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Abstract 

In the last twenty-five years (1990-2014), algorithmic advances in integer opti¬ 
mization combined with hardware improvements have resulted in an astonishing 
200 billion factor speedup in solving Mixed Integer Optimization (MIO) prob¬ 
lems. We present a MIO approach for solving the classical best subset selection 
problem of choosing k out of p features in linear regression given n observations. 

We develop a discrete extension of modern first order continuous optimization 
methods to find high quality feasible solutions that we use as warm starts to 
a MIO solver that finds provably optimal solutions. The resulting algorithm (a) 
provides a solution with a guarantee on its suboptimality even if we terminate the 
algorithm early, (b) can accommodate side constraints on the coefficients of the 
linear regression and (c) extends to finding best subset solutions for the least ab¬ 
solute deviation loss function. Using a wide variety of synthetic and real datasets, 
we demonstrate that our approach solves problems with n in the 1000s and p in 
the 100s in minutes to provable optimality, and finds near optimal solutions for n 
in the 100s and p in the 1000s in minutes. We also establish via numerical exper¬ 
iments that the MIO approach performs better than Lasso and other popularly 
used sparse learning procedures, in terms of achieving sparse solutions with good 
predictive power. 
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1 Introduction 


We consider the linear regression model with response vector y nx i, model matrix X = 
[xi,..., x p ] G W ixp , regression coefficients (3 G M pxl and errors e G M nxl : 

y = X/3 + e. 

We will assume that the columns of X have been standardized to have zero means and 
unit ^ 2 -norm. In many important classical and modern statistical applications, it is 
desirable to obtain a parsimonious fit to the data by finding the best fc-feature fit to the 
response y. Especially in the high-dimensional regime with p n, in order to conduct 
statistically meaningful inference, it is desirable to assume that the true regression 
coefficient (3 is sparse or may be well approximated by a sparse vector. Quite naturally, 
the last few decades have seen a flurry of activity in estimating sparse linear models with 
good explanatory power. Central to this statistical task lies the best subset problem [40] 
with subset size k, which is given by the following optimization problem: 

min h\y-X(3\\ 2 2 subject to \\(3\\ 0 < k, (1) 

(3 Z 

where the (pseudo)norm of a vector /3 counts the number of nonzeros in /3 and is given 
by ||/3||o = Y7i=i 1 (A 0), where l(-) denotes the indicator function. The cardinality 
constraint makes Problem (1) NP-hard [41], Indeed, state-of-the-art algorithms to solve 
Problem (1), as implemented in popular statistical packages, like leaps in R, do not 
scale to problem sizes larger than p = 30. Due to this reason, it is not surprising that 
the best subset problem has been widely dismissed as being intractable by the greater 
statistical community. 

In this paper we address Problem (1) using modern optimization methods, specifically 
mixed integer optimization (MIO) and a discrete extension of first order continuous 
optimization methods. Using a wide variety of synthetic and real datasets, we demon¬ 
strate that our approach solves problems with n in the 1000s and p in the 100s in 
minutes to provable optimality, and finds near optimal solutions for n in the 100s and 
p in the 1000s in minutes. To the best of our knowledge, this is the first time that 
MIO has been demonstrated to be a tractable solution method for Problem (1). We 
note that we use the term tractability not to mean the usual polynomial solvability 
for problems, but rather the ability to solve problems of realistic size in times that are 
appropriate for the applications we consider. 

As there is a vast literature on the best subset problem, we next give a brief and selective 
overview of related approaches for the problem. 


2 


Brief Context and Background 

To overcome the computational difficulties of the best subset problem, computationally 
tractable convex optimization based methods like Lasso [49, 17] have been proposed as 
a convex surrogate for Problem (1). For the linear regression problem, the Lagrangian 
form of Lasso solves 

n ^ n |||y — x /3||| + A||/31|i, (2) 

where the i x penalty on /3, i.e., ||/3||i = Yhi IAI shrinks the coefficients towards zero 
and naturally produces a sparse solution by setting many coefficients to be exactly 
zero. There has been a substantial amount of impressive work on Lasso [23, 15, 5, 55, 
32, 59, 19, 35, 39, 53, 50] in terms of algorithms and understanding of its theoretical 
properties—see for example the excellent books or surveys [11, 34, 50] and the references 
therein. 

Indeed, Lasso enjoys several attractive statistical properties and has drawn a significant 
amount of attention from the statistics community as well as other closely related fields. 
Under various conditions on the model matrix X and n,p, /3 it can be shown that 
Lasso delivers a sparse model with good predictive performance [11, 34], In order to 
perform exact variable selection, much stronger assumptions are required [11]. Sufficient 
conditions under which Lasso gives a sparse model with good predictive performance 
are the restricted eigenvalue conditions and compatibility conditions [11]. These involve 
statements about the range of the spectrum of sub-matrices of X and are difficult to 
verify, for a given data-matrix X. 

An important reason behind the popularity of Lasso is its computational feasibility and 
scalability to practical sized problems. Problem (2) is a convex quadratic optimization 
problem and there are several efficient solvers for it, see for example [44, 23, 29]. 

In spite of its favorable statistical properties, Lasso has several shortcomings. In the 
presence of noise and correlated variables, in order to deliver a model with good predic¬ 
tive accuracy, Lasso brings in a large number of nonzero coefficients (all of which are 
shrunk towards zero) including noise variables. Lasso leads to biased regression coef¬ 
ficient estimates, since the U-norm penalizes the large coefficients more severely than 
the smaller coefficients. In contrast, if the best subset selection procedure decides to in¬ 
clude a variable in the model, it brings it in without any shrinkage thereby draining the 
effect of its correlated surrogates. Upon increasing the degree of regularization, Lasso 
sets more coefficients to zero, but in the process ends up leaving out true predictors 
from the active set. Thus, as soon as certain sufficient regularity conditions on the data 
are violated, Lasso becomes suboptimal as (a) a variable selector and (b) in terms of 
delivering a model with good predictive performance. 

The shortcomings of Lasso are also known in the statistics literature. In fact, there 
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is a significant gap between what can be achieved via best subset selection and Lasso: 
this is supported by empirical (for small problem sizes, i.e., p < 30) and theoretical 
evidence, see for example, [46, 58, 38, 31, 56, 48] and the references therein. Some 
discussion is also presented herein, in Section 4. 

To address the shortcomings, non-convex penalized regression is often used to “bridge” 
the gap between the convex i\ penalty and the combinatorial £ 0 penalty [38, 27, 24, 
54, 55, 28, 61, 62, 57, 13]. Written in Lagrangian form, this gives rise to continuous 
non-convex optimization problems of the form: 

|||y — X/3||^ + ^j?(|/3i|;7;A), (3) 

i 

where p(\/3\] 7 ; A) is a non-convex function in f3 with A and 7 denoting the degree of reg¬ 
ularization and non-convexity, respectively. Typical examples of non-convex penalties 
include the minimax concave penalty (MCP), the smoothly clipped absolute deviation 
(SCAD), and penalties (see for example, [27, 38, 62, 24]). There is strong statistical 
evidence indicating the usefulness of estimators obtained as minimizers of non-convex 
penalized problems (3) over Lasso see for example [56, 36, 54, 25, 52, 37, 60, 26]. In a 
recent paper, [60] discuss the usefulness of non-convex penalties over convex penalties 
(like Lasso) in identifying important covariates, leading to efficient estimation strate¬ 
gies in high dimensions. They describe interesting connections between £ 0 regularized 
least squares and least squares with the hard thresholding penalty; and in the process 
develop comprehensive global properties of hard thresholding regularization in terms of 
various metrics. [26] establish asymptotic equivalence of a wide class of regularization 
methods in high dimensions with comprehensive sampling properties on both global 
and computable solutions. 

Problem (3) mainly leads to a family of continuous and non-convex optimization prob¬ 
lems. Various effective nonlinear optimization based methods (see for example [62, 24, 
13, 36, 54, 38] and the references therein) have been proposed in the literature to ob¬ 
tain good local minimizers to Problem (3). In particular [38] proposes Sparsenet, a 
coordinate-descent procedure to trace out a surface of local minimizers for Problem (3) 
for the MCP penalty using effective warm start procedures. None of the existing ap¬ 
proaches for solving Problem (3), however, come with guarantees of how close the 
solutions are to the global minimum of Problem (3). 

The Lagrangian version of (1) given by 

Illy - x/JIIj + A ^ i(ft + 0 ), (4) 

Z=1 

may be seen as a special case of (3). Note that, due to non-convexity, problems (4) 
and (1) are not equivalent. Problem (1) allows one to control the exact level of spar¬ 
sity via the choice of fc, unlike (4) where there is no clear correspondence between A 
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and k. Problem (4) is a discrete optimization problem unlike continuous optimization 
problems (3) arising from continuous noil-convex penalties. 

Insightful statistical properties of Problem (4) have been explored from a theoretical 
viewpoint in [56, 31, 32, 48]. [48] points out that (1) is preferable over (4) in terms of 
superior statistical properties of the resulting estimator. The aforementioned papers, 
however, do not discuss methods to obtain provably optimal solutions to problems (4) 
or (1), and to the best of our knowledge, computing optimal solutions to problems (4) 
and (1) is deemed as intractable. 


Our Approach In this paper, we propose a novel framework via which the best 
subset selection problem can be solved to optimality or near optimality in problems 
of practical interest within a reasonable time frame. At the core of our proposal is a 
computationally tractable framework that brings to bear the power of modern discrete 
optimization methods: discrete first order methods motivated by first order methods 
in convex optimization [45] and mixed integer optimization (MIO), see [4], We do 
not guarantee polynomial time solution times as these do not exist for the best subset 
problem unless P=NP. Rather, our view of computational tractability is the ability of 
a method to solve problems of practical interest in times that are appropriate for the 
application addressed. An advantage of our approach is that it adapts to variants of 
the best subset regression problem of the form: 

nun |||y — X/3||| 

s.t. ||/3|| 0 < k 

A/3 < b, 

where A/3 < b represents polyhedral constraints and q G {1, 2} refers to a least absolute 
deviation or the least squares loss function on the residuals r := y — X/3. 


Existing approaches in the Mathematical Optimization Literature In a sem¬ 
inal paper [30] , the authors describe a leaps and bounds procedure for computing global 
solutions to Problem (1) (for the classical n > p case) which can be achieved with com¬ 
putational effort significantly less than complete enumeration, leaps, a state-of-the-art 
R package uses this principle to perform best subset selection for problems with n > p 
and p < 30. [3] proposed a tailored branch-and-bound scheme that can be applied to 
Problem (1) using ideas from [30] and techniques in quadratic optimization, extending 
and enhancing the proposal of [6]. The proposal of [3] concentrates on obtaining high 
quality upper bounds for Problem (1) and is less scalable than the methods presented 
in this paper. 
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Contributions We summarize our contributions in this paper below: 

1. We use MIO to find a provably optimal solution for the best subset problem. 
Our approach has the appealing characteristic that if we terminate the algorithm 
early, we obtain a solution with a guarantee on its suboptimality. Furthermore, 
our framework can accommodate side constraints on (3 and also extends to finding 
best subset solutions for the least absolute deviation loss function. 

2. We introduce a general algorithmic framework based on a discrete extension of 
modern first order continuous optimization methods that provide near-optimal 
solutions for the best subset problem. The MIO algorithm significantly benefits 
from solutions obtained by the first order methods and problem specific informa¬ 
tion that can be computed in a data-driven fashion. 

3. We report computational results with both synthetic and real-world datasets that 
show that our proposed framework can deliver provably optimal solutions for 
problems of size n in the 1000s and p in the 100s in minutes. For high-dimensional 
problems with n G {50,100} and p G {1000, 2000}, with the aid of warm starts and 
further problem-specific information, our approach finds near optimal solutions 
in minutes but takes hours to prove optimality. 

4. We investigate the statistical properties of best subset selection procedures for 
practical problem sizes, which to the best of our knowledge, have remained largely 
unexplored to date. We demonstrate the favorable predictive performance and 
sparsity-inducing properties of the best subset selection procedure over its com¬ 
petitors in a wide variety of real and synthetic examples for the least squares and 
absolute deviation loss functions. 

The structure of the paper is as follows. In Section 2, we present a brief overview of MIO, 
including a summary of the computational advances it has enjoyed in the last twenty-five 
years. We present the proposed MIO formulations for the best subset problem as well 
as some connections with the compressed sensing literature for estimating parameters 
and providing lower bounds for the MIO formulations that improve their computational 
performance. In Section 3, we develop a discrete extension of first order methods in 
convex optimization to obtain near optimal solutions for the best subset problem and 
establish its convergence properties—the proposed algorithm and its properties may be 
of independent interest. Section 4 briefly reviews some of the statistical properties of the 
best-subset solution, highlighting the performance gaps in prediction error, over regular 
Lasso-type estimators. In Section 5, we perform a variety of computational tests on 
synthetic and real datasets to assess the algorithmic and statistical performances of our 
approach for the least squares loss function for both the classical overdetermined case 
n > p, and the high-dimensional case p n. In Section 6, we report computational 
results for the least absolute deviation loss function. In Section 7, we include our 
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concluding remarks. Due to space limitations, some of the material has been relegated 
to the Appendix. 


2 Mixed Integer Optimization Formulations 

In this section, we present a brief overview of MIO, including the simply astonishing 
advances it has enjoyed in the last twenty-five years. We then present the proposed 
MIO formulations for the best subset problem as well as some connections with the 
compressed sensing literature for estimating parameters. We also present completely 
data driven methods to estimate parameters in the MIO formulations that improve 
their computational performance. 

2.1 Brief Background on MIO 

The general form of a Mixed Integer Quadratic Optimization (MIQO) problem is as 
follows: 

min ck t Q« + ol t a 

s.t. An < b 

oii e {0, l}, Vi g x 

a j £ Vj ^ X, 

where a G M m , A G M fcxm , b G and Q G M mxm (positive semidefmite) are the 
given parameters of the problem; M + denotes the non-negative reals, the symbol < 
denotes element-wise inequalities and we optimize over ot G R m containing both discrete 
(cq,i G X) and continuous (cq,i ^ X) variables, with X C {1, ..., m}. For background 
on MIO see [4]. Subclasses of MIQO problems include convex quadratic optimization 
problems (X = 0), mixed integer (Q = 0 mxm ) and linear optimization problems ( 
X = 0, Q = 0 mxm ). Modern integer optimization solvers such as Gurobi and Cplex 
are able to tackle MIQO problems. 

In the last twenty-five years (1991-2014) the computational power of MIO solvers has 
increased at an astonishing rate. In [7], to measure the speedup of MIO solvers, the 
same set of MIO problems were tested on the same computers using twelve consecu¬ 
tive versions of Cplex and version-on-version speedups were reported. The versions 
tested ranged from Cplex 1.2, released in 1991 to Cplex 11, released in 2007. Each 
version released in these years produced a speed improvement on the previous version, 
leading to a total speedup factor of more than 29,000 between the first and last version 
tested (see [7], [42] for details). Gurobi 1.0, a MIO solver which was first released 
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in 2009, was measured to have similar performance to Cplex 11. Version-on-version 
speed comparisons of successive Gurobi releases have shown a speedup factor of more 
than 20 between Gurobi 5.5, released in 2013, and Gurobi 1.0 ([7], [42]). The com¬ 
bined machine-independent speedup factor in MIO solvers between 1991 and 2013 is 
580,000. This impressive speedup factor is due to incorporating both theoretical and 
practical advances into MIO solvers. Cutting plane theory, disjunctive programming 
for branching rules, improved heuristic methods, techniques for preprocessing MIOs, 
using linear optimization as a black box to be called by MIO solvers, and improved 
linear optimization methods have all contributed greatly to the speed improvements in 
MIO solvers [7]. 

In addition, the past twenty years have also brought dramatic improvements in hard¬ 
ware. Figure 1 shows the exponentially increasing speed of supercomputers over the 
past twenty years, measured in billion floating point operations per second [1]. The 
hardware speedup from 1993 to 2013 is approximately 10 5 - 5 ~ 320, 000. When both 
hardware and software improvements are considered, the overall speedup is approxi¬ 
mately 200 billion! Note that the speedup factors cited here refer to mixed integer linear 
optimization problems, not MIQO problems. The speedup factors for MIQO problems 
are similar. MIO solvers provide both feasible solutions as well as lower bounds to the 
optimal value. As the MIO solver progresses towards the optimal solution, the lower 
bounds improve and provide an increasingly better guarantee of suboptimality, which 
is especially useful if the MIO solver is stopped before reaching the global optimum. In 
contrast, heuristic methods do not provide such a certificate of suboptimality. 



Figure 1: Log of Peak Supercomputer Speed from 1993-2013. 

The belief that MIO approaches to problems in statistics are not practically relevant 
was formed in the 1970s and 1980s and it was at the time justified. Given the astonish¬ 
ing speedup of MIO solvers and computer hardware in the last twenty-five years, the 
mindset of MIO as theoretically elegant but practically irrelevant is no longer justified. 
In this paper, we provide empirical evidence of this fact in the context of the best subset 
selection problem. 




2.2 MIO Formulations for the Best Subset Selection Prob¬ 
lem 

We first present a simple reformulation to Problem (1) as a MIO (in fact a MIQO) 
problem: 

Zi = min |||y — X/3 HI 

P,z 

s.t. -MjjZi < fa < MuZi,i = 1,... ,p 

(5) 

Zi e {o,i},« = 1, - - -,p 

E ^ < fc, 

2=1 

where z G {0,1} P is a binary variable and At u is a constant such that if (3 is a minimizer 
of Problem (5), then Mu > ||/3||oo- If ^ = 1, then |/3j| < Mu and if Zi = 0, then f3i = 0. 
Thus, YZ =i i s an indicator of the number of non-zeros in /3. 

Provided that Mu is chosen to be sufficently large with A iu > ||/3||oc, a solution to 
Problem (5) will be a solution to Problem (1). Of course, Mu is not known a priori, 
and a small value of A iu may lead to a solution different from (1). The choice of M. v 
affects the strength of the formulation and is critical for obtaining good lower bounds 
in practice. In Section 2.3 we describe how to find appropriate values for M.u. Note 
that there are other MIO formulations, presented herein (See Problem (8)) that do not 
rely on a-priori specifications of Aiu- However, we will stick to formulation (5) for the 
time being, since it provides some interesting connections to the Lasso. 

Formulation (5) leads to interesting insights, especially via the structure of the convex 
hull of its constraints, as illustrated next: 

Conv ^ j/3 : |A| < MjjZi, Zi G {0, l},i = 1,... ,p, J2 z i < 

= {(3 ■■ HPHoo < Mu, Mi < Muk} C {(3 : Mi < Muk}. 

Thus, the minimum of Problem (5) is lower-bounded by the optimum objective value 
of both the following convex optimization problems: 

Z 2 := min ^||y - X/3||| subject to ||/3||oo < Mu, ||/3||i < Muk (6) 

(3 Z 

Z 3 :=mm ^11y — X/3111 subject to ||/3||i < Muk, (7) 

where (7) is the familiar Lasso in constrained form. This is a weaker relaxation than 
formulation (6), which in addition to the i\ constraint on f3, has box-constraints con¬ 
trolling the values of the /3ds. It is easy to see that the following ordering exists: 
Z 3 < Z 2 < Zi, with the inequalities being strict in most instances. 
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In terms of approximating the optimal solution to Problem (5), the MIO solver begins by 
first solving a continuous relaxation of Problem (5). The Lasso formulation (7) is weaker 
than this root node relaxation. Additionally, MIO is typically able to significantly 
improve the quality of the root node solution as the MIO solver progresses toward the 
optimal solution. 

To motivate the reader we provide an example of the evolution (see Figure 2) of the 
MIO formulation (8) for the Diabetes dataset [23], with n = 350, p = 64 (for further 
details on the dataset see Section 5). 

Since formulation (5) is sensitive to the choice of Aiu, we consider an alternative MIO 
formulation based on Specially Ordered Sets [4] as described next. 


Formulations via Specially Ordered Sets Any feasible solution to formulation (5) 
will have (1 — Zi)/3i = 0 for every i 6 {1,... ,p}. This constraint can be modeled via 
integer optimization using Specially Ordered Sets of Type 1 [4] (SOS-1). In an SOS-1 
constraint, at most one variable in the set can take a nonzero value, that is 

(1 - = 0 (A, 1 - Zi) : SOS-1, 

for every % — 1,... ,p. This leads to the following formulation of (1): 

min | ||y — X/3||| 

P, z 

s.t. ( j3i , 1 - : SOS-1, i = 1,... ,p 

(8) 

Zi e {0,1 },i = l,...,p 

v 

'Yh Zi <k. 

i =1 

We note that Problem (8) can in principle be used to obtain global solutions to Prob¬ 
lem (1) — Problem (8) unlike Problem (5) does not require any specification of the 
parameter M-u- 
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k = 6 


k = 7 



Time (secs) 


Time (secs) 


Figure 2: The typical evolution of the MIO formulation (8) for the diabetes dataset with 
n = 350 ,p = 64 with k = 6 (left panel) and k = 7 (right panel). The top panel shows the 
evolution of upper and lower bounds with time. The lower panel shows the evolution of the 
corresponding MIO gap with time. Optimal solutions for both the problems are found in a 
few seconds in both examples, but it takes 10-20 minutes to certify optimality via the lower 
bounds. Note that the time taken for the MIO to certify convergence to the global optimum 
increases with increasing k. 


We now proceed to present a more structured representation of Problem (8). Note that 
objective in this problem is a convex quadratic function in the continuous variable /3, 
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which can be formulated explicitly as: 


min § /3 r X T X/3 - (X'y ,(3) + \ ||y||| 

P,Z 

s.t. 1 — Zi): SOS-1, i = l,...,p 

Zi E {0,1}, i — 1,... ,p 

p (9) 

Y J z l <k 
1=1 

—Mu < Pi < Mu,i — l,...,p 

Mi <M e . 

We also provide problem-dependent constants Mu and Me E [0, oo]. Mu provides an 
upper bound on the absolute value of the regression coefficients and Me provides an 
upper bound on the fd-norm of (3. Adding these bounds typically leads to improved 
performance of the MIO, especially in delivering lower bound certificates. In Section 2.3, 
we describe several approaches to compute these parameters from the data. 

We also consider another formulation for (9): 

min l C T C ~ (X'y, P) + \ ||y||| 

P,Z,C 

s.t. C = X/3 

(Pi, 1 - Zi) : SOS-1, i = 1,... ,p 
Zi E {0, l},i = 1,. ..,p 

t*i<k (10) 

1=1 

—Mu < Pi < Mu, i = 1, • • • ,p 
m\i <Me 

—My < Q — -Mu, i — 1, • ■ ■, n 

IICIIi < Mi, 


where the optimization variables are (3 E W, C E M n , z E {0,1} P and Mu, Me, M\j, M.\ E 
[0, cxd] are problem specific parameters. Note that the objective function in formula¬ 
tion (10) involves a quadratic form in n variables and a linear function in p variables. 
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Problem (10) is equivalent to the following variant of the best subset problem: 


nun |||y — X/3||| 

s-t. ||/3||o < k , n v 

Halloo <Mtr,||/3||i<M/ 

\\XPWoo<Mi,\\XP\\i<Mi 

Formulations (9) and (10) differ in the size of the quadratic forms that are involved. The 
current state-of-the-art MIO solvers are better-equipped to handle mixed integer linear 
optimization problems than MIQO problems. Formulation (9) has fewer variables but a 
quadratic form in p variables—we find this formulation more useful in the n > p regime, 
with p in the 100s. Formulation (10) on the other hand has more variables, but involves 
a quadratic form in n variables—this formulation is more useful for high-dimensional 
problems p n, with n in the 100s and p in the 1000s. 

As we said earlier, the bounds on f3 and £ are not required, but if these constraints 
are provided, they improve the strength of the MIO formulation. In other words, 
formulations with tightly specified bounds provide better lower bounds to the global 
optimization problem in a specified amount of time, when compared to a MIO formula¬ 
tion with loose bound specifications. We next show how these bounds can be computed 
from given data. 

2.3 Specification of Parameters 

In this section, we obtain estimates for the quantities Aty, Ah, Af^, M.\ such that 
an optimal solution to Problem (11) is also an optimal solution to Problem (1), and 
vice-versa. 


Coherence and Restricted Eigenvalues of a Model Matrix 


Given a model matrix X, [51] introduced the cumulative coherence function 

p[k] := max max V | (X ? , X*) |, 

\I\=k j0 — 


iei 


where, X,-, j — 1,... ,p represent the columns of X, i.e., features. 


For k — 1, we obtain the notion of coherence introduced in [22, 21] as a measure of 
the maximal pairwise correlation in absolute value of the columns of X: p := p[l] = 
max t/j |(Xj, Xj)|. 
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[16, 14] (see also [11] and references therein) introduced the notion that a matrix X 
satisfies a restricted eigenvalue condition if 

A min (XjX/) > 7] k for every / C {1,... ,p} : \I\ < k, (12) 

where A m ; n (X / / X/) denotes the smallest eigenvalue of the matrix X'jX/. An inequality 
linking fi[k] and rj k is as follows. 

Proposition 1. The following bounds hold : 

(a) [51]: n[k] < n ■ k. 

(b) [21] : rjk > 1 - n[k - 1] > 1 - ji ■ (k - 1). 

The computations of fi[k\ and g k for general k are difficult, while /i is simple to compute. 
Proposition 1 provides bounds for /i[k\ and r/ k in terms of the coherence /i. 


Operator Norms of Submatrices 

The (p , q) operator norm of matrix A is 

II A|| Pi9 := max ||Au|| p . 

|| u ||q = l 

We will use extensively here the (1,1) operator norm. We assume that each column 
vector of X has unit t' 2 -norni. The results derived in the next proposition borrow and 
enhance techniques developed by [51] in the context of analyzing the l\—£ 0 equivalence 
in compressed sensing. 

Proposition 2. For any I C {1,... ,p} with |/| = k we have : 

(a) HX^Xj — iHi,! < At[fc — 1]. 

(b) If the matrix X)X/ is invertible and ||XjX/ — I|| u < 1, then ||(X / 7 X J )- 1 || 1)1 < 

1 

l—fi[k—l] ' 

Proof. See Section A.3. □ 

We note that Part (b) also appears in [51] for the operator norm ||(X 7 X/) -1 || 00i00 . 

Given a set / C {1,... ,p} with |/| = k we let /3j denote the least squares regres¬ 
sion coefficients obtained by regressing y on X/, i.e., (3j = (XjX/^Xjy . If we 
append (3j with zeros in the remaining coordinates we obtain (3 as follows: (3 G 
argming.£. =0j jrf 7 ||y — X/3|||. Note that f3 depends on / but we will suppress the de¬ 
pendence on / for notational convenience. 
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2.3.1 Specification of Parameters in terms of Coherence and Restricted 
Strong Convexity 

Recall that Xj, j = 1 ,,p represent the columns of X; and we will use x ?; , i = 1,... ,n 
to denote the rows of X. As discussed above llXdl = 1. We order the correlations 

l(X.„y)|: 

l(X (1) ,y>| > |(X (2)l y)|... > |(X (p) ,y)|. (13) 

We finally denote by ||xj|| 1:fc the sum of the top k absolute values of the entries of 

Xij,j E {l,2,...,p}. 

Theorem 2.1. For any k > 1 such that /i[k — 1] < 1 any optimal solution (3 to (1) 
satisfies: 


(b) 


(d) 


Il3||i< 

ll@l|oo < 

11X311! < 
I|x3iu < 


i-p[k-i}^ 




mm < — 
Vk 


mm 


^^l (X °’) ,y )l 2 ’v^H y l 

n 

^2 Il x *ll°°ll3l|i> v^Hy|| 2 


%— 1 


max ||xj|| 1:fc ||/3|| c 

2=1,...,71 


Proof. For proof see Section A.4. 


(14) 

(15) 

(16) 
(17) 

□ 


We note that in the above theorem, the upper bound in Part (a) becomes infinite as 
soon as fi[k — 1] > 1. In such a case, we can use purely data-driven bounds by using 
convex optimization techniques, as described in Section 2.3.2. 

The interesting message conveyed by Theorem 2.1 is that the upper bounds on ||/3||i, 
11/31|oo, ||X/3||i and ||X/3||oo, corresponding to the Problem (11) can all be obtained in 
terms of r]k and p[k — 1], quantities of fundamental interest appearing in the analysis 
of regularization methods and understanding how close they are to fo solutions [51, 
22, 21, 16, 14], On a different note, Theorem 2.1 arises from a purely computational 
motivation and quite curiously, involves the same quantities: cumulative coherence and 
restricted eigenvalues. 

Note that the quantities p[k — 1], pk are difficult to compute exactly, but they can be ap¬ 
proximated by Proposition 1 which provides bounds commonly used in the compressed 
sensing literature. Of course, approximations to these quantities can also be obtained 
by using subsampling schemes. 
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2.3.2 Specification of Parameters via Convex Quadratic Optimization 

We provide an alternative purely data-driven way to compute the upper bounds to the 
parameters by solving several simple convex quadratic optimization problems. 


Bounds on /%’s 

For the case n > p, upper and lower bounds on fa can be obtained by solving the 
following pair of convex optimization problems: 

uf : = max fa u~ := min B; 

P P (18) 

s.t. |||y — X/31|| < UB, s.t. |||y — X/3||| < UB, 

for % — 1,... ,p. Above, UB is an upper bound to the minimum of the /c-subset least 
squares problem (1). vfa is an upper bound to fa, since the cardinality constraint 
||/3|| o < k does not appear in the optimization problem. Similarly, u~ is a lower bound to 
fa. The quantity Mfa = max{|uT|, |n)“|} serves as an upper bound to \fa\. A reasonable 
choice for UB is obtained by using the discrete first order methods (Algorithms 1 and 2 
as described in Section 3) in combination with the MIO formulation (8) (for a predefined 
amount of time). Having obtained as described above, we can obtain an upper 
bound to ||/3||oo an d ||/3||i as follows: Aiu = maXjAd^ and ||/3||i < Yli=i where, 
M$ > > ... > M^. 

Similarly, bounds corresponding to Parts (c) and (d) in Theorem 2.1 can be obtained 
by using the upper bounds on ||/3||oo> ||/3||i as described above. 

Note that the quantities uf and u~ are finite when the level sets of the least squares 
loss function are finite. In particular, the bounds are loose when p > n. In the following 
we describe methods to obtain non-trivial bounds on (x,;, f3), for i — 1,..., n that apply 
for arbitrary n, p. 


Bounds on (xj,/3)’s 


We now provide a generic method to obtain upper and lower bounds on the quantities 

(x,:,3): 


vf := max (x i? /3) 

s.t. i||y-X/3||!<UB, 


v i := min (x;,/3) 

s.t. illy — X/31|| < UB, 


(19) 
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for % — 1,..., n. Note that the bounds obtained from (19) are non-trivial bounds for 
both the under-determined n < p and overdetermined cases. The bounds obtained 
from (19) are upper and lower bounds since we drop the cardinality constraint on 
(3. The bounds are finite since for every i £ {1 the quantity (xj,/3) remains 

bounded in the feasible set for Problems (19). 

The quantity v* = max{|u^|, |u~|} serves as an upper bound to |(xj,/3)|. In particular, 
this leads to simple upper bounds on HX/^Hoo < max* v* and ||X/3||i < v * an d can 
be thought of completely data-driven methods to estimate bounds appearing in (16) 
and (17). 

We note that Problems (18) and (19) have nice structure amenable to efficient compu¬ 
tation as we discuss in Section A.l. 


2.3.3 Parameter Specifications from Advanced Warm-Starts 

The methods described above in Sections 2.3.1 and 2.3.2 lead to provable bounds on 
the parameters: with these bounds Problem (11) is provides an optimal solution to 
Problem (1), and vice-versa. We now describe some other alternatives that lead to 
excellent parameter specifications in practice. 

The discrete first order methods described in the following section 3 provide good upper 
bounds to Problem (1). These solutions when supplied as a warm-start to the MIO 
formulation (8) are often improved by MIO, thereby leading to high quality solutions 
to Problem (1) within several minutes. If /3 hyb denotes an estimate obtained from this 
hybrid approach, then M.u := rH/^yJloo with r a multiplier greater than one (e.g., 
r £ {1.1,1.5, 2}) provides a good estimate for the parameter A4u. A reasonable upper 
bound to ||/3||i is kM.u■ Bounds on the other quantities: ||X/3|| i, HX/SHoo can be derived 
by using expressions appearing in Theorem 2.1, with aforementioned bounds on ||/3||i 
and ||3||oo- 

2.3.4 Some Generalizations and Variants 

Some variations and improvements of the procedures described above are presented in 
Section A.2 (appendix). 
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3 Discrete First Order Algorithms 


In this section, we develop a discrete extension of first order methods in convex opti¬ 
mization [45, 44] to obtain near optimal solutions for Problem (1) and its variant for 
the least absolute deviation (LAD) loss function. Our approach applies to the problem 
of minimizing any smooth convex function subject to cardinality constraints. 

We will use these discrete first order methods to obtain solutions to warm start the MIO 
formulation. In Section 5, we will demonstrate how these methods greatly enhance the 
performance of the MIO. 

3.1 Finding stationary solutions for minimizing smooth con¬ 
vex functions with cardinality constraints 

Related work and contributions In the signal processing literature [8, 9] proposed 
iterative hard-thresholding algorithms, in the context of to-regularized least squares 
problems, i.e., Problem (4). The authors establish convergence properties of the algo¬ 
rithm under the assumption that X satisfies coherence [8] or Restricted Isometry Prop¬ 
erty [9] . The method we propose here applies to a larger class of cardinality constrained 
optimization problems of the form (20), in particular, in the context of Problem (1) our 
algorithm and its convergence analysis do not require any form of restricted isometry 
property on the model matrix X. 

Our proposed algorithm borrows ideas from projected gradient descent methods in first 
order convex optimization problems [45] and generalizes it to the discrete optimization 
Problem (20). We also derive new global convergence results for our proposed algo¬ 
rithms as presented in Theorem 3.1. Our proposal, with some novel modifications also 
applies to the non-smooth least absolute deviation loss with cardinality constraints as 
discussed in Section 3.3. 

Consider the following optimization problem: 

nun g(/3) subject to ||/3|| 0 < k , (20) 

where g(/3) > 0 is convex and has Lipschitz continuous gradient: 

l|Vg(/3) — \7g(f3)\\ < £||/3 — /3||. (21) 

The first ingredient of our approach is the observation that when g(/3) = ||/3 — for 
a given c, Problem (20) admits a closed form solution. 
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Proposition 3. If f3 is an optimal solution to the following problem: 

(3 G argmin \\/3 — c\\l , (22) 

m\o<k 

then it can be computed as follows: /3 retains the k largest (in absolute value) elements 
o/c e K p and sets the rest to zero, i.e., if |c(i)| > |c( 2 )| > ... > C( p )|, denote the ordered 
values of the absolute values of the vector c, then: 

o _ f c i> ifie{(l),...,(k)}, 

|^0, otherwise , 

where, is the ith coordinate of (3. We will denote the set of solutions to Problem (22) 
by the notation H fc (c). 

Proof. We provide a proof of this in Section B.2, for the sake of completeness. □ 

Note that, we use the notation “argmin” (Problem (22) and in other places that follow) 
to denote the set of minimizers of the optimization Problem. 

The operator (23) is also known as the hard-thresholding operator [20]— a notion that 
arises in the context of the following related optimization problem: 

f3e argmin ^||/3 — c||l + A||/3|| 0 , (24) 

13 2 

where (3 admits a simple closed form expression given by = c t if \cf > \f\ and 3i — 0 
otherwise, for i — 1,... ,p. 

Remark 1. There is an important difference between the minimizers of Problems (22) 
and (24). For Problem (24), the smallest (in absolute value) non-zero element in /3 
is greater than A in absolute value. On the other hand, in Problem (22) there is no 
lower bound to the minimum (in absolute value) non-zero element of a minimizer. This 
needs to be taken care of while analyzing the convergence properties of Algorithm 1 
(Section 3.2). 

Given a current solution (3, the second ingredient of our approach is to upper bound 
the function g{rj) around g(f3). To do so, we use ideas from projected gradient descent 
methods in first order convex optimization problems [45, 44], 

Proposition 4. ([45, 44]) For a convex function g((3) satisfying condition (21) and 
for any L > £ we have : 

gin) ^ Ql(t (3) := g((3) + ^\\rj - [3\\l + (Vg(f3),rj - [3) (25) 

for all (3 , rj with equality holding at [3 = rj. 
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Applying Proposition 3 to the upper bound Ql(?7, A) in Proposition 4 we obtain 


arg min 

Ql(t 7 ,A) = arg min 

l!»7llo< fc 

ll»7l|o<fc 


= arg min 


ll»7l|o<fc 


=Hfc U - 


L 

77 - 


77 


P-jVg{0) 




0 - jVgW 


1 


(26) 


where H^(-) is defined in (23). In light of (26) we are now ready to present Algorithm 1 
to find a stationary point (see Definition 1) of Problem (20). 


Algorithm 1 
Input: g(/3), L, e. 

Output: A first order stationary solution (3*. 

Algorithm: 

1. Initialize with /3 1 G M p such that ||/3i||o < k. 

2. For m > 1, apply (26) with [3 = f3 m to obtain (3 m+1 as: 

(3 m +1 e H, ^f3 m - —Vg(/3 m )^ (27) 

3. Repeat Step 2 , until || (3 m+1 — f3 m H 2 < e. 

4. Let /3 m \= (/3 m 1 ,... , f3 mp ) denote the current estimate and let / = Supp(/3 m ) := 
{i : (3 mi 7 ^ 0}. Solve the continuous optimization problem: 

min g(/3), (28) 

PA=o, i£i 


and let (3* be a minimizer. 

The convergence properties of Algorithm 1 are presented in Section 3.2. We also 
present Algorithm 2, a variant of Algorithm 1 with better empirical performance. Al¬ 
gorithm 2 modifies Step 2 of Algorithm 1 by using a line search. It obtains r] m G 
H fc (Am - iV^(Am)) and (3 m+1 = A m 77 m +(l-A m )/3m> where A m G arg min A g ( Ar/ m + (1 - A )(3 m ). 

Note that the iterate (3 m in Algorithm 2 need not be fc-sparse (i.e., need not satisfy: 

II Am II o<^)j however, 77 m is fc-sparse (||77 m ||o < k). Moreover, the sequence may not lead 
to a decreasing set of objective values, but it satisfies: g((3 m+1 ) < g(rf m ) ^ g(/3 m ). 
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3.2 Convergence Analysis of Algorithm 1 

In this section, we study convergence properties for Algorithm 1. Before we em¬ 
bark on the analysis, we need to define the notion of first order optimality for Prob¬ 
lem (20). 

Definition 1. Given an L > t, the vector rj G MT is said to be a first order stationary 
point of Problem (20) if ||ry||o < k and it satisfies the following fixed point equation: 

Vg(rj) \ . (29) 


rj G Hfc 77 - - 


1 

L 


Let us give some intuition associated with the above definition. 

Consider rj as in Definition 1. Since ||? 7 ||o A k, it follows that there is a set / C 
{1, ... ,p} such that rji = 0 for all i G / and the size of I c (complement of I) is k. Since 
r] G H fc (77 — ^Vg(r/)) , it follows that for all i ^ /, we have: rji = rfo — f Vig(rf), where, 
Vig(rj) is the ztli coordinate of Vg(rj). It thus follows that: Vig(rj) = 0 for all i £ I. 
Since < 7 ( 77 ) is convex in 77 , this means that rj solves the following convex optimization 
problem: 

min < 7 ( 77 ) s.t. rji — 0,i G I. (30) 

n 

Note however, that the converse of the above statement is not true. That is, if / C 
{ 1 , ... ,p} is an arbitrary subset with \I C \ = k then a solution 777 to the restricted convex 
problem (30) with I = I need not correspond to a first order stationary point. 

Note that any global minimizer to Problem (20) is also a first order stationary point, 
as defined above (see Proposition 7). 

We present the following proposition (for its proof see Section B. 6 ), which sheds light 
on a first order stationary point 77 for which ||77 Ho < k. 

Proposition 5. Suppose rj satisfies the first order stationary condition (29) and ||77 Ho < 

k. Then 77 G arg min g(/3). 

P 

We next define the notion of an e-approximate first order stationary point of Prob¬ 
lem ( 20 ): 

Definition 2. Given an e > 0 and L > £ we say that rj satisfies an e-approximate first 
order optimality condition of Problem (20) i/11 ^711 0 A k and for some fj G H/, (77 — \y g(r 7 )), 
we have ||77 — 77 H 2 < e. 

Before we dive into the convergence properties of Algorithm 1 , we need to introduce 
some notation. Let /3 m = ..., /3 mp ) and l m = (ei,..., e p ) with e 3 = 1, if [J m j 0, 

and ej = 0 , if fi mj — 0, j = 1,... ,p, i.e., l m represents the sparsity pattern of the 
support of f3 m . 
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Suppose, we order the coordinates of f3 m by their absolute values: |/d(i), m | > 1/3(2),m| > 

... > |/3(p),m.|• Note that by definition (27), /3p) jm = 0 for all i > k and m > 2. We 
denote a ktm = \P(k),m\ to be the kill largest (in absolute value) entry in (3 m for all 
m > 2. Clearly if a k)m > 0 then ||/3, m || 0 = k and if a ktm = 0 then ||/3 m || 0 < k. Let 
a k := limsup a^ m and a k := lirninf a^ m . 

ra—>■ oo m^-oo 

Proposition 6. Consider g{(3) and I as defined in (20) and (21). Let (3 m ,m > 1 be 
the sequence generated by Algorithm 1. Then we have : 

(a) For any L > I, the sequence g((3 m ) satisfies 

9(Pm) ~ 9(Pm+ 1) > II Prn+l ~ Pm\\l , (31) 

is decreasing and converges. 

(b) If L> i, then /3 m+1 — (3 m —> 0 as m —>■ oo. 

(c) If L > I and a k > 0 then the sequence l m converges after finitely many iterations, 
i.e., there exists an iteration index M* such that l m = l m+ i for all m > M*. 
Furthermore, the sequence (3 m is bounded and converges to a first order stationary 
point. 

(d) If L > I and a k = 0 then lirninf || Vg(/3 m )||oo = 0. 

m—>oo 

(e) Let L > I, a k = 0 and suppose that the sequence /3 m has a limit point. Then 
9(Pm) t nun g(P). 

Proof. See Section B.l. □ 

Remark 2. Note that the existence of a limit point in Proposition 6, Part (e) is guaran¬ 
teed under fairly weak conditions. One such condition is that sup ({/3 : ||/3|| 0 < k, f(P) < f 0 }) < 
oo, for any finite value fo- In words this means that the k-sparse level sets of the func¬ 
tion g(P) is bounded. 

In the special case where g(P) is the least squares loss function, the above condition 
is equivalent to every k-submatrix (Xj ) of X comprising of k columns being full rank. 

In particular, this holds with probability one when the entries of X are drawn from a 
continuous distribution and k < n. 

Remark 3. Parts (d) and (e) of Proposition 6 are probably not statistically interest¬ 
ing cases, since they correspond to un-regularized solutions of the problem min g(P). 
However, we include them since they shed light on the properties of Algorithm 1. 

The conditions assumed in Part (c) imply that the support of P m stabilizes and Algo¬ 
rithm 1 behaves like vanilla gradient descent thereafter. The support of P m need not 
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stabilize for Parts (d), (e) and thus Algorithm 1 may not behave like vanilla gradi¬ 
ent descent after finitely many iterations. However, the objective values (under minor 
regularity assumptions) converge to min g((3). 

We present the following Proposition (for proof see Section B.3) about a uniqueness 
property of the fixed point equation ( 1 ). 

Proposition 7. Suppose L > £ and let 77 satisfy a first order stationary point as in 
Definition 1. Then the set H/. (77 — j^^g(rj)) has exactly one element: 77 . 

The following proposition (for a proof see Section B.4) shows that a global minimizer 
of the Problem (20) is also a first order stationary point. 

Proposition 8 . Suppose L > i and let (3 be a global minimizer of Problem (20). Then 
(3 is a first order stationary point. 

Proposition 6 establishes that Algorithm 1 either converges to a first order stationarity 
point (part (c)) or it converges 1 to a global optimal solution (Parts (d), (e)), but does 
not quantify the rate of convergence. We next characterize the rate of convergence of 
the algorithm to an e-approximate first order stationary point. 

Theorem 3.1. Let L > i and (3* denote a first order stationary point of Algorithm 1. 
After M iterations Algorithm 1 satisfies 


min ||/3 m+ i - (3 r . 


r < 
I 2 — 


2 Wi)-#)) 

M(L - £) 


(32) 


where g((3 m ) i 9((3*) as m —>■ 00 . 


Proof. See Section B.5. □ 

Theorem 3.1 implies that for any e > 0 there exists M = O(-) such that for some 
1 < m* < M, we have: \\f3 m * +1 — /3 m »||| < e. Note that the convergence rates de¬ 
rived above apply for a large class of problems ( 20 ), where, the function g(/3) > 0 is 
convex with Lipschitz continuous gradient (21). Tighter rates may be obtained under 
additional structural assumptions on g(-). For example, the adaptation of Algorithm 1 
for Problem (4) was analyzed in [ 8 , 9] with X satisfying coherence [ 8 ] or Restricted 
Isometry Property (RIP) [9]. I 11 these cases, the algorithm can be shown to have a 
linear convergence rate [ 8 , 9], where the rate depends upon the RIP constants. 

Note that by Proposition 6 the support of (3 m stabilizes after finitely many iterations, 
after which Algorithm 1 behaves like gradient descent on the stabilized support. If g{(3) 
restricted to this support is strongly convex, then Algorithm 1 will enjoy a linear rate 
of convergence [45], as soon as the support stabilizes. This behavior is adaptive, i.e., 
Algorithm 1 does not need to be modified after the support stabilizes. 

1 under minor technical assumptions 
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The next section describes practical post-processing schemes via which first order sta¬ 
tionary points of Algorithm 1 can be obtained by solving a low dimensional convex 
optimization problem, as soon as the support is found to stabilize, numerically. In our 
numerical experiments, we this version of Algorithm 1 (with multiple starting points) 
took at most a few minutes for p = 2000 and a few seconds for smaller values of p. 


Polishing coefficients on the active set 


Algorithm 1 detects the active set after a few iterations. Once the active set stabilizes, 
the algorithm may take a number of iterations to estimate the values of the regression 
coefficients on the active set to a high accuracy level. 

In this context, we found the following simple polishing of coefficients to be useful. 
When the algorithm has converged to a tolerance of e (~ 10~ 4 ), we fix the current active 
set, X, and solve the following lower-dimensional convex optimization problem: 


min 


g(0)- 


(33) 


In the context of the least squares and the least absolute deviation problems, Prob¬ 
lem (33) reduces to to a smaller dimensional least squares and a linear optimization 
problem respectively, which can be solved very efficiently up to a very high level of 
accuracy. 


3.3 Application to Least Squares 

For the support constrained problem with squared error loss, we have g(f3) = |||y — 
X/31|| and Vg(/3) = — X'(y — X/3). The general algorithmic framework developed 
above applies in a straightforward fashion for this special case. Note that for this case 
£ = A max (X'X). 

The polishing of the regression coefficients in the active set can be performed via a 
least squares problem on y, X/, where I denotes the support of the regression coeffi¬ 
cients. 


3.4 Application to Least Absolute Deviation 

We will now show how the method proposed in the previous section applies to the least 
absolute deviation problem with support constraints in (3: 

nun #i(/3) := ||Y - X/3||i s.t. ||/3|| 0 < k. (34) 


24 


Since gi(/3) is non-smooth, our framework does not apply directly. We smooth the 
non-differentiable gi(/3) so that we can apply Algorithms 1 and 2 . Observing that 
gi(f3) = sup|| w || oo<1 (Y — Xf3,w) we make use of the smoothing technique of [43] to 
obtain gi(/3;r) = sup|| w || oo<1 ((Y — Xf3, w) — |||w|||); which is a smooth approximation 

of g x (/3), with l = Amax [. x ' x) for which Algorithms 1 and 2 apply. 

In order to obtain a good approximation to Problem (34), we found the following 
strategy to be useful in practice: 

1. Fix r > 0, initialize with /3 0 E and repeat the following steps [2]—[3] till 
convergence: 

2. Apply Algorithm 1 (or Algorithm 2) to the smooth function gi(/3]r). Let /3* be 
the limiting solution. 

3. Decrease r ry for some pre-dehned constant 7 = 0.8 (say), and go back to 
step [1] with (3 0 = /3*. Exit if r < TOL, for some pre-dehned tolerance. 


4 A Brief Tour of the Statistical Properties of Prob¬ 
lem (1) 

As already alluded to in the introduction, there is a substantial body of impressive 
work characterizing the theoretical properties of best subset solutions in terms of vari¬ 
ous metrics: predictive performance, estimation of regression coefficients, and variable 
selection properties. For the sake of completeness, we present a brief review of some of 
the properties of solutions to Problem (1) in Section C. 


5 Computational Experiments for Subset Selection 
with Least Squares Loss 

In this section, we present a variety of computational experiments to assess the algo¬ 
rithmic and statistical performances of our approach. We consider both the classical 
overdetermined case with n > p (Section 5.2) and the high dimensional p n case 
(Section 5.3) for the least squares loss function with support constraints. 
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5.1 Description of Experimental Data 

We demonstrate the performance of our proposal via a series of experiments on both 
synthetic and real data. 


Synthetic Datasets. We consider a collection of problems where x, ~ N(0, X),i = 
1,... ,n are independent realizations from a p-dimensional multivariate normal distri¬ 
bution with mean zero and covariance matrix X := (cr^). The columns of the X matrix 
were subsequently standardized to have unit I 2 norm. For a fixed X nxp , we generated 

the response y as follows: y = X/3° + e, where e, ~ lV(0,cr 2 ). We denote the number 
of nonzeros in (3° by k$. The choice of X, f3°,a determines the Signal-to-Noise Ratio 
(SNR) of the problem, which is defined as: SNR = vai ^ 2 ^ \ 

We considered the following four different examples: 

Example 1: We took a l3 = for i,j e {l,...,p} x {1, We consider 

different values of k 0 e {5,10} and /3f = 1 for k 0 equi-spaced values. In the case where 
exactly equi-spaced values are not possible we rounded the indices to the nearest large 
integer value, of i in the range { 1 , 2, ... ,p}. 

Example 2: We took X = I pxp , k 0 = 5 and /3° = (l' 5xl , 0}_ 5xl )' G M p . 

Example 3: We took X = I pXp , k 0 = 10 and /3f — | + (10 — i — 1,..., 10 and 

/3 ? ° = 0, Vi > 10 — i.e., a vector with ten nonzero entries, with the nonzero values being 
equally spaced in the interval [|, 10]. 

Example 4: We took X = I pxp , ko = 6 and (3° = (—10, —6, —2, 2, 6,10, 0 P _6), i.e., a 
vector with six nonzero entries, equally spaced in the interval [—10,10]. 


Real Datasets We considered the Diabetes dataset analyzed in [23]. We used the 
dataset with all the second order interactions included in the model, which resulted in 
64 predictors. We reduced the sample size to n = 350 by taking a random sample and 
standardized the response and the columns of the model matrix to have zero means 
and unit f 2 -norm. 

In addition to the above, we also considered a real microarray dataset: the Leukemia 
data [18]. We downloaded the processed dataset from http: //stat. ethz. ch/~dettling/ 
bagboost.html, which had n = 72 binary responses and more than 3000 predictors. 
We standardized the response and columns of features to have zero means and unit 
^ 2 -norm. We reduced the set of features to 1000 by retaining the features maximally 
correlated (in absolute value) to the response. We call the resulting feature matrix X nxp 
with n = 72, p = 1000. We then generated a semi-synthetic dataset with continuous 


26 




response as y = X/3 + e, where the first five coefficients of /3 were taken as one and 
the rest as zero. The noise was distributed as e* ~ IV(0,(X 2 ), with a 2 chosen to get a 
SNR=7. 


Computer Specifications and Software Computations were carried out in a linux 

64 bit server—Intel(R) Xeon(R) eight-core processor @ 1.80GHz, 16 GB of RAM for 
the overdetermined n > p case and in a Dell Precision T7600 computer with an Intel 
Xeon E52687 sixteen-core processor @ 3.1GHz, 128GB of Ram for the high-dimensional 
p n case. The discrete first order methods were implemented in Matlab 2012b. 
We used Gurobi [33] version 5.5 and the Matlab interface to Gurobi for all of our 
experiments, apart from the computations for synthetic data for n > p, which were 
done in GUROBI via its Python 2.7 interface. 


5.2 The Overdetermined Regime: n > p 

Using the Diabetes dataset and synthetic datasets, we demonstrate the combined effect 
of using the discrete first order methods with the MIO approach. Together, these 
methods show improvements in obtaining good upper bounds and in closing the MIO 
gap to certify global optimality. Using synthetic datasets where we know the true linear 
regression model, we perform side-by-side comparisons of this method with several other 
state-of-the-art algorithms designed to estimate sparse linear models. 


5.2.1 Obtaining Good Upper Bounds 

We conducted experiments to evaluate the performance of our methods in terms of 
obtaining high quality solutions for Problem (1). 

We considered the following three algorithms: 

(a) Algorithm 2 with fifty random initializations 2 . We took the solution corresponding 
to the best objective value. 

(b) MIO with cold start, i.e., formulation (9) with a time limit of 500 seconds. 

(c) MIO with warm start. This was the MIO formulation initialized with the discrete 
first order optimization solution obtained from (a). This was run for a total of 
500 seconds. 

2 we took fifty random starting values around 0 of the form min(i — 1, l)e, i = 1,..., 50, where e ~ 
X(0 px i,4I). We found empirically that Algorithm 2 provided better upper bounds than Algorithm 1. 
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To compare the different algorithms in terms of the quality of upper bounds, we run 
for every instance all the algorithms and obtain the best solution among them, say, /*. 
If /aig denotes the value of the best subset objective function for method “alg”, then 
we define the relative accuracy of the solution obtained by “alg” as: 

Relative Accuracy = (/ alg - /*)//*, (35) 

where alg G {(a), (b), (c)} as described above. 

We did experiments for the Diabetes dataset for different values of k (see Table 1). For 
each of the algorithms we report the amount of time taken by the algorithm to reach 
the best objective value during the time of 500 seconds. 


k 

Discrete First Order 
Accuracy Time 

MIO Cold Start 
Accuracy Time 

MIO Warm Start 
Accuracy Time 

9 

0.1306 

1 

0.0036 

500 

0 

346 

20 

0.1541 

1 

0.0042 

500 

0 

77 

49 

0.1915 

1 

0.0015 

500 

0 

87 

57 

0.1933 

1 

0 

500 

0 

2 


Table 1: Quality of upper bounds for Problem (1) for the Diabetes dataset, for different 
values of k. We see that the MIO equipped with warm starts deliver the best upper bounds 
in the shortest overall times. The run time for the MIO with warm start includes the time 
taken by the discrete first order method (which were all less than a second). 

Using the discrete first order methods in combination with the MIO algorithm resulted 
in finding the best possible relative accuracy in a matter of a few minutes. 


5.2.2 Improving MIO Performance via Warm Starts 

We performed a series of experiments on the Diabetes dataset to obtain a globally 
optimal solution to Problem (1) via our approach and to understand the implications 
of using advanced warm starts to the MIO formulation in terms of certifying optimality. 
For each choice of k we ran Algorithm 2 with fifty random initializations. They took 
less than a few seconds to run. We used the best solution as an advanced warm start to 
the MIO formulation (9). For each of these examples, we also ran the MIO formulation 
without any warm start information and also without the parameter specifications in 
Section 2.3 (we refer to this as “cold start”). Figure 3 summarizes the results. The 
figure shows that in the presence of warm starts and problem specific side information, 
the MIO closes the optimality gap significantly faster. 







k=9 


k=20 


k=31 


k=42 






Figure 3: The evolution of the MIO optimality gap (in log 10 (-) scale) for Problem (1), for 
the Diabetes dataset with n = 350, p = 64 with and without warm starts (and parameter 
specifications as in Section 2.3) for different values of k. The MIO significantly benefits by 
advanced warm starts delivered by Algorithm 2. In all of these examples, the global optimum 
was found within a very small fraction of the total time, but the proof of global optimality 
came later. 

5.2.3 Statistical Performance 

We considered datasets as described in Example 1, Section 5.1— we took different values 
of n, p with n > p, p with ko = 10. 


Competing Methods and Performance Measures For every example, we con¬ 
sidered the following learning procedures for comparison purposes: (a) the MIO ap¬ 
proach equipped warm starts from Algorithm 2 (annotated as “MIO” in the figure), 
(b) the Lasso, (c) Sparsenet and (d) stepwise regression (annotated as “Step” in the 
figure). 

We used R to compute Lasso, Sparsenet and stepwise regression using the glmnet 1.7.3, 
Sparsenet and Stats 3.0.2 packages respectively, which were all downloaded from CRAN 
at http://cran.us.r-proj ect.org/. 

In addition to the above, we have also performed comparisons with a debiased version of 
the Lasso: i.e., performing unrestricted least squares on the Lasso support to mitigate 
the bias imparted by Lasso shrinkage. 

We note that Sparsenet [38] considers a penalized likelihood formulation of the form (3), 
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where the penalty is given by the generalized MCP penalty family (indexed by A, 7 ) for 
a family of values of 7 > 1 and A > 0. The family of penalties used by Sparsenet is thus 
given by: p(t; 7 ; A) = A(|t| — ^)I(|t| < A 7 ) + 4^I(|t| > A 7 ) for 7 , A described as above. 
As 7 = 00 with A fixed, we get the penalty p(t ; 7 ; A) = X\t\. The family above includes 
as a special case (7 = 1 ), the hard thresholding penalty, a penalty recommended in the 
paper [60] for its useful statistical properties. 

For each procedure, we obtained the “optimal” tuning parameter by selecting the model 
that achieved the best predictive performance on a held out validation set. Once the 
model /3 was selected, we obtained the prediction error as: 

Prediction Error = ||X/3 - X/3°|| 1/1|X/3°|| 1. (36) 

We report “prediction error” and number of non-zeros in the optimal model in our 
results. The results were averaged over ten random instances, for different realizations 
of X, e. For every run: the training and validation data had a fixed X but random 
noise e. 

Figure 4 presents results for data generated as per Example 1 with n = 500 and p = 
100. We see that the MIO procedure performs very well across all the examples. Among 
the methods, MIO performs the best, followed by Sparsenet, Lasso with Step(wise) 
exhibiting the worst performance. In terms of prediction error, the MIO performs the 
best, only to be marginally outperformed by Sparsenet in a few instances. This further 
illustrates the importance of using non-convex methods in sparse learning. Note that 
the MIO approach, unlike Sparsenet certifies global optimality in terms of solving 
Problem 1. ffowever, based on the plots in the upper panel, Sparsenet selects a few 
redundant variables unlike MIO. Lasso delivers quite dense models and pays the price 
in predictive performance too, by selecting wrong variables. As the value of SNR 
increases, the predictive power of the methods improve, as expected. The differences 
in predictive errors between the methods diminish with increasing SNR values. With 
increasing values of p (from left panel to right panel in the figure), the number of 
non-zeros selected by the Lasso in the optimal model increases. 

We also performed experiments with the debiased version of the Lasso. The unre¬ 
stricted least squares solution on the optimal model selected by Lasso (as shown in 
Figure 4) had worse predictive performance than the Lasso, with the same sparsity 
pattern. This is probably due to overfitting since the model selected by the Lasso is 
quite dense compared to n,p. We also tried some variants of debiased Lasso which 
led to models with better performances than the Lasso but the results were inferior 
compared to MIO — we provide a detailed description in Section D.2. 

We also performed experiments with n = 1000, p = 50 for data generated as per 
Example 1. We solved the problems to provable optimality and found that the MIO 
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Figure 4: Figure showing the sparsity (upper panel) and predictive performances (bottom 
panel) for different subset selection procedures for the least squares loss. Here, we consider 
data generated as per Example 1, with n = 500, p = 100, ko = 10, for three different SNR 
values with [Left Panel] p = 0.5, [Middle Panel] p = 0.8, and [Right Panel] p = 0.9. The 
dashed line in the top panel represents the true number of nonzero values. For each of the 
procedures, the optimal model was selected as the one which produced the best prediction 
accuracy on a separate validation set, as described in Section 5.2.3. 


performed very well when compared to other competing methods. We do not report 
the experiments for brevity. 


5.2.4 MIO model training 

We trained a sequence of best subset models (indexed by k) by applying the MIO ap¬ 
proach with warm starts. Instead of running the MIO solvers from scratch for different 
values of k, we used callbacks , a feature of integer optimization solvers. Callbacks allow 
the user to solve an initial model, and then add additional constraints to the model 
one at a time. These “cuts” reduce the size of the feasible region without having to 
rebuild the entire optimization model. Thus, in our case, we can save time by building 
the initial optimization model for k = p. Once the solution for k = p is obtained, a cut 
can be added to the model: z t < k for k — p — 1 and the model can be re-solved 
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from this point. We apply this procedure until we arrive at a model with k — 1. 

For each value of k tested, the MIO best subset algorithm was set to stop the first time 
either an optimality gap of 1% was reached or a time limit of 15 minutes was reached. 
Additionally, we only tested values of k from 5 through 25, and used Algorithm 2 to 
warm start the MIO algorithm. We observed that it was possible to obtain speedups 
of a factor of 2-4 by carefully tuning the optimization solver for a particular problem, 
but chose to maintain generality by solving with default parameters. Thus, we do not 
report times with the intention of accurately benchmarking the best possible time but 
rather to show that it is computationally tractable to solve problems to optimality using 
modern MIO solvers. 

5.3 The High-Dimensional Regime: p^> n 

In this section, we investigate (a) the evolution of upper bounds in the high-dimensional 
regime, (b) the effect of a bounding box formulation on the speed of closing the opti¬ 
mality gap and (c) the statistical performance of the MIO approach in comparison to 
other state-of-the art methods. 


5.3.1 Obtaining Good Upper Bounds 

We performed tests similar to those in Section 5.2.1 for the p>n regime. We tested 
a synthetic dataset corresponding to Example 2 with n = 30, p = 2000 for varying 
SNR values (see Table 2) over a time of 500s. As before, using the discrete first order 
methods in combination with the MIO algorithm resulted in finding the best possible 
upper bounds in the shortest possible times. 

We also did experiments on the Leukemia dataset. In Figure 5 we demonstrate the 
evolution of the objective value of the best subset problem for different values of k. For 
each value of k, we warm-started the MIO with the solution obtained by Algorithm 2 
and allowed the MIO solver to run for 4000 seconds. The best objective value obtained 
at the end of 4000 seconds is denoted by /*. We plot the Relative Accuracy, i.e., 
(ft — /*)//*, where ft is the objective value obtained after t seconds. The figure shows 
that the solution obtained by Algorithm 2 is improved by the MIO on various instances 
and the time taken to improve the upper bounds depends upon k. In general, for 
smaller values of k the upper bounds obtained by the MIO algorithm stabilize earlier, 
i.e., the MIO finds improved solutions faster than larger values of k. 
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k 

Discrete First Order 
Accuracy Time 

IVIIO Cold Start 
Accuracy Time 

MIO Warm Start 
Accuracy Time 

co 

5 

0.1647 

37.2 

1.0510 

500 

o 

72.2 

ii 

6 

0.6152 

41.1 

0.2769 

500 

o 

77.1 


7 

0.7843 

40.7 

0.8715 

500 

o 

160.7 

c n 

8 

0.5515 

38.8 

2.1797 

500 

o 

295.8 


9 

0.7131 

45.0 

0.4204 

500 

o 

96.0 

, _ 

5 

0.5072 

45.6 

0.7737 

500 

o 

65.6 

ii 

6 

1.3221 

40.3 

0.5121 

500 

o 

82.3 


7 

0.9745 

40.9 

0.7578 

500 

o 

210.9 

c n 

8 

0.8293 

40.5 

1.8972 

500 

o 

262.5 


9 

1.1879 

44.2 

0.4515 

500 

o 

254.2 


Table 2: The quality of upper bounds for Problem (1) obtained by Algorithm 2, MIO with 
cold start and MIO warm-started with Algorithm 2. We consider the synthetic dataset of 
Example 2 with n = 30, p = 2000 and different values of SNR. The MIO method, when warm- 
started with the first order solution performs the best in terms of getting a good upper bound 
in the shortest time. The metric “Accuracy” is defined in (35). The first order methods are 
fast but need not lead to highest quality solutions on their own. MIO improves the quality 
of upper bounds delivered by the first order methods and their combined effect leads to the 
best performance. 



Figure 5: Behavior of MIO aided with warm start in obtaining good upper bounds over 
time for the Leukemia dataset (n = 72, p = 1000). The vertical axis shows relative accuracy, 
i.e., (ft — /*)//*, where ft is the objective value obtained after t seconds and /* denotes 
the best objective value obtained by the method after 4000 seconds. The colored diamonds 
correspond to the locations where the MIO (with warm start) attains the best solution. The 
figure shows that MIO improves the solution obtained by the first order method in all the 
instances. The time at which the best possible upper bound is obtained depends upon the 
choice of k. Typically larger k values make the problem harder —hence the best solutions are 
obtained after a longer wait. 
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5.3.2 Bounding Box Formulation 


With the aid of advanced warm starts as provided by Algorithm 2, the MIO obtains a 
very high quality solution very quickly—in most of the examples the solution thus ob¬ 
tained turns out to be the global minimum. However, in the typical “high-dimensional” 
regime, with p n, we observe that the certificate of global optimality comes later as 
the lower bounds of the problem “evolve” slowly. This is observed even in the pres¬ 
ence of warm starts and using the implied bounds as developed in Section 2.2 and is 
aggravated for the cold-started MIO formulation (10). 

To address this, we consider the MIO formulation (37) obtained by adding bound¬ 
ing boxes around a local solution. These restrictions guide the MIO in restricting its 
search space and enable the MIO to certify global optimality inside that bounding box. 
We consider the following additional bounding box constraints to the MIO formula¬ 
tion (10): 

{/3 : IIX/3 - X/3J, < 4, oc ) n {/3 : 11/3 - /3J, < <,„} . 

where, (3 0 is a candidate sparse solution. The radii of the two 0-balls above, namely, 
£} loc and £^ loc are user-defined parameters and control the size of the feasible set. 

Using the notation C = X/3 we have the following MIO formulation (equipped with the 
additional bounding boxes): 

min § C T C- (X'y,/3) + § IMI! 

£>,z,C 

s.t. c = X/3 

(A, 1 - Zi) : SOS type-1, i = 1 ,. .. ,p 
Zi e {0,1 },i = l ,...,p 

i=! (37) 

—Mu A /3j < = 1 , ■ ■ ■ ,p 

MU < Me 

-M c u <Ci<M c u ,i = l,...,n 
IICIIi <M\ 

IIC — Colli — ^lioc 

ll/3-/3olli<<ioc- 

For large values of loc (respectively, £|? loc ) the constraints on X/3 (respectively, (3) 
become ineffective and one gets back formulation (10). To see the impact of these addi¬ 
tional cutting planes in the MIO formulation, we consider a few examples as illustrated 
in Figures 6,7,12. 
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Interpretation of the bounding boxes A local bounding box in the variable 
C = X/3 directs the MIO solver to seek for candidate solutions that deliver models 
with predictive accuracy “similar” (controlled by the radius of the ball) to a reference 
predictive model, given by £ 0 . In our experiments, we typically chose Co as the solution 
delivered by running MIO (warm-started with a first order solution) for a few hundred 
to a few thousand seconds. More generally, £ 0 may be selected by any other sparse 
learning method. In our experiments, we found that the run-time behavior of the MIO 
depends upon how correlated the columns of X are — more correlation leading to longer 
run-times. 

Similarly, a bounding box around /3 directs the MIO to look for solutions in the neigh¬ 
borhood of a reference point /3 0 . In our experiments, we chose the reference /3 0 as the 
solution obtained by MIO (warm-started with a first order solution) and allowing it to 
run for a few hundred to a few thousand seconds. We observed that the MIO solver 
in presence of bounding boxes in the /3-space certified optimality and in the process 
finding better solutions; much faster than the (^-bounding box method. 

Note that the /3-bounding box constraint leads to 0(p ) and the £-box leads to O(n) 
constraints. Thus, when p n the additional C constraints add a fewer number of 
extra variables when compared to the /3 constraints. 


Experiments In the first set of experiments, we consider the Leukemia dataset with 
n = 72, p = 1000. We took two different values of k e {5,10} and for each case we ran 
Algorithm 2 with several random restarts. The best solution thus obtained was used 
to warm start the MIO formulation (10), which we ran for an additional 3600 seconds. 
The solution thus obtained is denoted by /3 0 . We then consider formulation (37) with 
d} ioc = oo and different values of d tXoc = Frac (as annotated in Figure 6) — the results 
are displayed in Figure 6. 

We consider another set of experiments to demonstrate the performance of the MIO 
in certifying global optimality for different synthetic datasets with varying n,p,k as 
well as with different structures on the bounding box. In the first case, we generated 
data as per Example 1 with p = 0.9, k 0 = 5. We consider the case with £ 0 = X/3 0 , 
£/?ioc = oo and C\ loc = 0.5||X/3 0 ||i, where f3 0 is a ^-sparse solution obtained from the 
MIO formulation (10) run with a time limit of 1000 seconds, after being warm-started 
with Algorithm 2. The results are displayed in Figure 7[Left Panel], In the second case 
(with data same as before) we obtained /3 0 in the same fashion as described before—we 
took a bounding box around /3 0 , and left the box constraint around X/3 0 inactive, i.e., 
we set d eioc = oo and £) 3 loc = ||/3 0 ||i/fc. We performed two sets of experiments, where 
the data were generated based on different SNR values—the results are displayed in 
Figure 7 with SNR=1 [Middle Panel] and SNR = 3[Right Panel], 
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Leukemia dataset: EfFect of a Bounding Box for MIO formulation (37) 
k = 5 A; = 10 




Figure 6: The effect of the MIO formulation (37) for the Leukemia dataset, for different 
values of k. Here £^ loc = oo and £f loc = Frac. For each value of k, the global minimum 
obtained was the same for the different choices of -Cf loc - 

In the same vein, we have Figure 12 studying the effect of formulations (37) for synthetic 
datasets generated as per Example 1 with n = 50, p = 1000, p = 0.9 and k 0 = 5. 



Figure 7: The effect of the MIO formulation (37) for a synthetic dataset as in Example 1 
with p = 0.9, fco = 5,n = 50,p = 500, for different values of k. [Left Panel] £^ loc = 0.51|X/3 0 1|i 
and C^ loc = oo for a data-set with SNR = 3. [Middle Panel] £^ loc = oo, £^ loc = ||/3 0 ||i/fc 

and SNR = 1. [Right Panel] Cfj ]oc = oo, £^ loc = ||/3 0 ||i/fe and SNR = 3. The figure shows 
that the bounding boxes in terms of X/3 (left-panel) make the problem harder to solve, when 
compared to bounding boxes around (3 (middle and right panels). A possible reason is due 
to the strong correlations among the columns of X. The SNR values do not seem to have a 
big impact on the run-times of the algorithms (middle and right panels). 
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5.3.3 Statistical Performance 


To understand the statistical behavior of MIO when compared to other approaches 
for learning sparse models, we considered synthetic datasets for values of n ranging 
from 30 — 50 and values of p ranging from 1000 — 2000. The following methods were 
used for comparison purposes (a) Algorithm 2. Here we used fifty different random 
initializations around 0, of the form inin(f — 1, l)A^(O px i, 41), i = 1 ,..., 50 and took the 
solution corresponding to the best objective value; (b) The MIO approach with warm 
starts from part (a); (c) The Lasso solution and (d) The Sparsenet solution. 

For methods (a), (b) we considered ten equi-spaced values of k in the range [3, 2k 0 ] 
(including the optimal value of ko)- For each of the methods, the best model was 
selected in the same fashion as described in Section 5.2.3 using separate validation 
sets. 

In addition, for some examples, we also study the performance of the debiased version 
of the Lasso, as described in Section 5.2.3. 

In Figure 8 and Figure 9 we present selected representative results from four different 
examples described in Section 5.1. 




Figure 8: The sparsity and predictive performance for different procedures: [Left Panel] 
shows Example 1 with n = 50, p = 1000,/? = 0.8, ko = 5 and [Right Panel] shows Example 2 
with n = 30, p = 1000—for each instance several SNR values have been shown. 

In Figure 8 the left panel shows the performance of different methods for Example 1 
with n = 50, p = 1000, p = 0.8, ko = 5. In this example, there are five non-zero 
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Figure 9: [Left Panel] Shows performance for data generated according to Example 3 with 
n = 30, p = 1000 and [Right Panel] shows Example 4 with n = 50, p = 2000. 


coefficients: the features corresponding to the non-zero coefficients are weakly correlated 
and a feature having a non-zero coefficient is highly correlated with a feature having 
a zero coefficient. In this situation, the Lasso selects a very dense model since it 
fails to distinguish between a zero and a non-zero coefficient when the variables are 
correlated—it brings both the coefficients in the model (with shrinkage). MIO (with 
warm-start) performs the best—both in terms of predictive accuracy and in selecting 
a sparse set of coefficients. MIO obtains the sparsest model among the four methods 
and seems to find better solutions in terms of statistical properties than the models 
obtained by the first order methods alone. Interestingly, the “optimal model” selected 
by the first order methods is more dense than that selected by the MIO. The number of 
non-zero coefficients selected by MIO remains fairly stable across different SNR values, 
unlike the other three methods. For this example, we also experimented with the 
different versions of debiased Lasso. In summary: the best debiased Lasso models had 
performance marginally better than Lasso but quite inferior to MIO. See the results in 
Appendix, Section D.2 for further details. 

In Figure 8 the right panel shows Example 2, with n = 30, p = 1000, ko = 5 and all 
non-zero coefficients equal one. In this example, all the methods perform similarly in 
terms of predictive accuracy. This is because all non-zero coefficients in (3° have the 
same value. In fact for the smallest value of SNR, the Lasso achieves the best predictive 
model. In all the cases however, the MIO achieves the sparsest model with favorable 
predictive accuracy. 
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In Figure 9, for both the examples, the model matrix is an iid Gaussian ensemble. The 
underlying regression coefficient {3° however, is structurally different than Example 2 
(as in Figure 8, right-panel). The structure in (3° is responsible for different statistical 
behaviors of the four methods across Figures 8 (right-panel) and Figure 9 (both panels). 
The alternating signs and varying amplitudes of (3° are responsible for the poor behavior 
of Lasso. The MIO (with warm-starts) seems to be the best among all the methods. 
For Example 3 (Figure 9, left panel) the predictive performances of Lasso and MIO 
are comparable—the MIO however delivers much sparser models than the Lasso. 

The key conclusions are as follows: 

1. The MIO best subset algorithm has a significant edge in detecting the correct 
sparsity structure for all examples compared to Lasso, Sparsenet and the stand¬ 
alone discrete first order method. 

2. For data generated as per Example 1 with large values of p, the MIO best subset 
algorithm gives better predictive performance compared to its competitors. 

3. For data generated as per Examples 2 and 3, MIO delivers similar predictive 
models like the Lasso, but produces much sparser models. In fact, Lasso seems 
to perform marginally better than MIO, as a predictive model for small values of 
SNR. 

4. For Example 4, MIO performs the best both in terms of predictive accuracy and 
delivering sparse models. 

6 Computational Results for Subset Selection with 
Least Absolute Deviation Loss 

In this section, we demonstrate how our method can be used for the best subset selection 
problem with LAD objective (34). 

Since the main focus of this paper is the least squares loss function, we consider only 
a few representative examples for the LAD case. The LAD loss is appropriate when 
the error follows a heavy tailed distribution. The datasets used for the experiments 
parallel those described in Section 5.1, the difference being in the distribution of e. We 
took 6i iid from a double exponential distribution with variance cr 2 . The value of cr 2 
was adjusted to get different values of SNR. 


Datasets analysed We consider a set-up similar to Example 1 (Section 5.1) with = 
5 and p = 0.9. Different choices of (n,p) were taken to cover both the overdetermined 
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(n = 500, p = 100) and high-dimensional cases (n = 50, p = 1000 and n = 500, p = 

1000 ). 

The other competing methods used for comparison were (a) discrete first order method (Sec¬ 
tion (3.4)) (b) MIO warm-started with the first order solutions and (c) the LAD loss 
with ty regularization: 

min ||y-X/3||i + A||/3||i, 

which we denote by LAD-Lasso. The training, validation and testing were done in the 
same fashion as in the least squares case. For each method, we report the number of 
non-zeros in the optimal model and associated prediction accuracy (36). 




Figure 10: The sparsity and predictive performance for different procedures for n = 500, p = 
100 for Problem (34). The data is generated as per Example 1 with p = 0.9, ko = 5 and double 
exponential errors—further details are available in the text. The acronym “Lasso” refers to 
LAD-Lasso (6). The MIO is seen to deliver sparser models with better predictive accuracy 
when compared to the LAD-Lasso. 

Figure 10 compares the MIO approach with others for LAD in the overdetermined 
case (n > p). Figure 11 does the same for the high-dimensional case (p n). The 
conclusions parallel those for the least squares case. Since, in the example considered, 
the features corresponding to the non-zero coefficients are weakly correlated and a 
feature having a non-zero coefficient is highly correlated with a feature having a zero 
coefficient—the LAD-Lasso selects an overly dense model and misses out in terms 
of prediction error. Both the MIO (with warm-starts) and the discrete first order 
methods behave similarly—much better than i\ regularization schemes. As expected, 
we observed that subset selection with least squares loss leads to inferior models for 
these examples, due to a heavy-tailed distribution of the errors. 

The results in this section are similar to the least squares case. The MIO approach 
provides an edge both in terms of sparsity and predictive accuracy compared to Lasso 
both for the overdetermined and the high-dimensional case. 
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Figure 11: Figure showing the number of nonzero values and predictive performance for 
different values of n andp for Problem (34) (as in Figure 10). [Left panel] has n = 50, p = 1000 
and [Right panel] has n = 500, p = 1000. 

7 Conclusions 

In this paper, we have revisited the classical best subset selection problem of choosing 
k out of p features in linear regression given n observations using a modern optimiza¬ 
tion lens, i.e., MIO and a discrete extension of first order methods from continuous 
optimization. Exploiting the astonishing progress of MIO solvers in the last twenty-five 
years, we have shown that this approach solves problems with n in the 1000s and p 
in the 100s in minutes to provable optimality, and finds near optimal solutions for n 
in the 100s and p in the 1000s in minutes. Importantly, the solutions provided by the 
MIO approach significantly outperform other state of the art methods like Lasso in 
achieving sparse models with good predictive power. LInlikc all other methods, the 
MIO approach always provides a guarantee on its sub-optimality even if the algorithm 
is terminated early. Moreover, it can accommodate side constraints on the coefficients 
of the linear regression and also extends to hireling best subset solutions for the least 
absolute deviation loss function. 

While continuous optimization methods have played and continue to play an important 
role in statistics over the years, discrete optimization methods have not. The evidence 
in this paper as well as in [2] suggests that MIO methods are tractable and lead to 
desirable properties (improved accuracy and sparsity among others) at the expense of 
higher, but still reasonable, computational times. 
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Appendix and Supplementary Material 
A Additional Details for Section 2 

A.l Solving the convex quadratic optimization Problems in 
Section 2.3.2 

We show here that the convex quadratic optimization problems appearing in Sec¬ 
tion 2.3.2 are indeed quite simple and can be solved with small computational cost. 

We first consider Problem (18), the computation of u~ which is a minimization problem. 
We assume without loss of generality that the feasible set of problem (18) is non-empty. 
Thus by standard results in quadratic optimization [10], it follows that, there exists a 
r such that: 

V(d||y-X/3||! + Tft[) =0, 

where, V denotes derivative wrt (3 and a {3 that satisfies the above gradient condition 
must also be feasible for Problem (18). Simplifying the above equation, we get: 

X'X/3 = X'y - re h 

where, e* is a vector in such that its ?'th coordinate is one with the remaining equal 
to zero. Simplifying the above expression, we have 

lly - mil = 11(1 - p x)y + rqiWl- 

Above, I is the identity matrix of size p x p and Px is the familiar projection matrix 
given by X(X'X) _1 X' 3 and g* = X(X'X) _1 ej. Observing that ||y — X/3||| = UB, 
one can readily estimate r that satisfies the above simple quadratic equation. This 
leads to the solution of r, which subsequently leads to the optimal value (3 that solves 
Problem (18). This readily leads to the optimum of Problem (18). 

The above argument readily applies to Problem (18), for the computation of uf by 
writing it as an equivalent minimization problem and observing that: 

~ u t = min ~Pi s -t- ^l|y - ml ^ UB - 

The above derivation can also be adapted to the case of Problem (19). Towards this 
end, notice that for estimating v~ the above steps (for computing w“) will be modified: 

3 Note that we assume here that p > n which typically guarantees that X'X is invertible with 
probability one, provided the entries of X are drawn from a continuous ditribution. 
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ei gets replaced by x, ; e 3? p (the zth row of X); and Px denotes the projection matrix 
onto the column space of X, even if the matrix X'X is not invertible (since here, we 
consider arbitrary n,p). 

In addition, the Problems (18) for the different variables and (19) for the different 
samples; can be solved completely independently, in parallel. 

A.2 Details for Section 2.3.4 

Note that in Problem (10) we consider a uniform bound on fa' s: —A 4u < fa < M.u, for 
all i = 1,... ,p. Note that some of the variables <3, may have larger amplitude than the 
others, thus it may be reasonable to have bounds depending upon the variable index 
i. Thusly motivated, for added flexibility, one can consider the following (adaptive) 
bounds on fa' s: —M.\j < fa < Mfa for i = 1,... ,p. The parameters A if can be taken 
as max{|«t[, |nj“|}, as defined in (18). 

More generally, one can also consider asymmetric bounds on fa as: vfa < fa < uf for 

all i. 

Note that the above ideas for bounding fa's can also be extended to obtain sample- 
specific bounds on (xj, (3) for i — 1,..., n. 

The bounds on ||/3||i and ||X/3||i, HX/SHoo can also be adapted to the above variable 
dependent bounds on /Vs. 

While the above modifications may lead to marginally improved performances, we do 
not dwell much on these improvements mainly for the sake of a clear exposition. 

A.3 Proof of Proposition 2 


Proof 

(a) Given a set /, we define G := X/X/ — I. and let g t j denote the (i, j)th entry of 
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G. For any u e M fc we have 


max ||Gu||i = max > 
" u " 1= i l|u||l=1 Vi^i 


ei 9ij u j 


3 =1 


A: A: 


< 


„max EENI^. 

U 1 = 1 \ — i — 

\*=1 3 =1 

max 1E |mj| E k 


0 =1 i^j 


< max (n[k - l]||u||i) 
Hull 1=1 


= fi[k — 1]. 


( 9jj — 0) 


El^'l - 9[k~l}j 

v. *7U / 


(b) Using XjX/ = I + G and standard power-series convergence (which is valid since 
l|G||i, i < 1 ) we obtain 

°o 1 1 

ll(x;x /) _1 || 1 , 1 = II (I + G )- 1 || M = Y, IIGIlii < — —- < 


i=0 


1 — || G || i ; i 1 — fi[k — 1] 


□ 


A.4 Proof of Theorem 2.1 


Proof 

(a) Since (3 I = (X 7 X/) _1 X 7 y we have 

fl3|| 1 = ||3 7 || 1 <||(X'X / )- 1 || 1 ,i||X' 7 y|| 1 . (38) 


Note that 


k 


ll x /y|li = - /sEl^y)! - E^ x o')>y)l- ( 39 ) 

jai jei j =l 

Applying Part (b) of Proposition 2 and (39) to (38), we obtain (14) . 

(b) We write /3 7 = Ay for A = (X 7 X/) _1 X 7 . If a^, i — 1, ..., k denote the rows of A 
we have: 


llloo 


= max | (a*, y) | < 

A; 


max 


H 2 


(40) 
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For every i — 1,..., k we have 


I a* || 2 < max ||Au|| 2 

||u || 2 = l 

= max ||(X' r X 7 'r 1 X' r u| 


U 2=1 


<A max ((X / / X/) _1 X / / ) 


— max 

d\ 


( 41 ) 


where di,, d\. are the (nonzero) singular values of the matrix X/. To see how 
one arrives at (41) let us denote the singular value decomposition of X/ = UDV' 
with D = diag (g^, d 2 ,..., <4) • We then have 

(X^X/)-^ = (VD^V')(UDV')' = VD^U' 

and the singular values of (XjX/^X^ are thus 1/di, i — 1,... ,k. 

The eigenvalues of XjX/ are df and from (12) we obtain that df > rjk- Using (41) 
we thus obtain 


Substituting the bound (42) to (40) we obtain 


(42) 


|/3j||oo < 




y 2- 


(43) 


Using the notation A = (X^Xj) , we have 

||3/||oo = max |(a,;, X^y)| 
1=1 


< ( max ||a*|| 2 ) HX^y|| 2 
<A max ((XjX/) -1 ) ||X'y|| 2 


= | max — 
i=l,...,k df 


l< x J.y>l : 

j&i 


1 

< — 
Vk 


, Xl< x ti).y>l 2 - 

\ 3=1 


( 44 ) 


Combining (43) and (44) we obtain (15). 
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(c) We have 


n n n 

iix/ 3 /ili < H Ii x *iun 3 /iii = II x ® il °° 1 1 3 /111 • (45) 

2=1 2=1 2=1 

Let P j X/(X / / X/)^ 1 Xj denote the projection onto the columns of X/. We 

have HP/ylb < ||y|| 2 , leading to: 

||X/3/||i = ||P/y||i < V^||P/y|| 2 < V^Hylh, (46) 

where we used that for any a e M m , we have -^/m||a ||2 > ||a||i. Combining (45) 
and (46) we obtain (16). 

(d) For any vector f3 r which has zero entries in the coordinates outside /, we have: 
llX/3/lloo < max ((x*, /3 Z )| < max ||||||1|oo, 


leading to (17). 


□ 


B Proofs and Technical Details for Section 3 

B.l Proof of Proposition 6 


Proof 

(a) Let (3 be a vector satisfying ||/3|| 0 < k. Using the notation rj 6 H fc (/3 — j^V g(/3)) 
we have the following chain of inequalities: 


9(P) = Ql(P,P) 

> „ inf, Ql(v,P) 


~\<k 


= 1 i ,f lf (^WV ~ PWI + {Vg(P),rj - P) + g(P) 

||r?|jo<fc \ Z 


= inf 

lhllo<fc 


L 


v- ( P- j^g(P) 


-^\\Vgm\l + g(P) 


|77 


/ 3 - 1 v 9 (/ 3 )) \\l-P\\Vg(p)\\l + m 


(From (26)) 


= (^\\V~P\\1 + (Vg(P), rj-P)+ g(P)^j 
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(From (25)) 


{~Y^ ^ _ ^ 11^ _ ^Il 2 + (^9(P), V-(3)+ g(P)^j 


> 


I± Y 1 \\V~ P\\l+ (^\\V~ P\\l + (Vs(/3), rj-P)+ g(P)^j 



Qe(v,P) 


> ^Y~ Wv*P\\l + 9(fi)- 


This chain of inequalities leads to: 


g(P) ~g(v) > 


L-e 

—^— h 


2 


V-PW 


2 

2 ' 


(47) 


Applying (47) for P = /3 rn and rj = P m+1 , the vectors generated by Algorithm 1 , 
we obtain (31). This implies that the objective values g(P m ) are decreasing and 
since the sequence is bounded below (g(P) > 0 ), we obtain that g(P m ) converges 
as m —> oo. 

(b) If L > £ and from part (a), the result follows. 

(c) The condition a k > 0 means that for all m sufficiently large, the entry |/3m m | will 
remain (uniformly) bounded away from zero. We will use this to prove that the 
support of P m converges. For the purpose of establishing contradiction suppose 
that the support does not converge. Then, there are infinitely many values of m' 
such that 1 m ' 7 ^ l m '+i- Using the fact that ||/3 m ||o = k for all large m we have 



(48) 


where i , j are such that = Pm',j — 0. As ml —> oo, the quantity in the 

rhs of (48) remains bounded away from zero since a k > 0. This contradicts 
the fact that P m+1 — P m — > 0, as established in part (b). Thus, l m converges, 
and since l m is a discrete sequence, it converges after finitely many iterations, 
that is l m = l m+ i for all m > M*. Algorithm 1 becomes a vanilla gradient 
descent algorithm, restricted to the space l rn for m > M*. Since a gradient 
descent algorithm for minimizing a convex function over a closed convex set leads 
to a sequence of iterates that converge [47, 45], we conclude that Algorithm 1 
converges. Therefore, the sequence P m converges to P*, a first order stationarity 
point: 



53 










(d) Let X m c {1,..., p} denote the set of k largest values of the vector (/3 m — 
in absolute value. By the definition of H/, (/3 m — , we have 



jVgiPr. 


> 



jy g(Pm) 


) 


for all i,j with i G I m and j ^ X m . Thus, 


lim inf min 

m —>oo i£lm 



1 

L 


Vg(Pr 


> lim inf max 

m—>oo j<£ Xm 


Moreover, 



J^giPm) 


(49) 



Hfc (3. 


1 

T 


V^(/3J 




% G X m , 
otherwise. 


Using the fact that (3 m+l — f3 m —> 0 we have 

(Vg((3 m ))i ->0 ,iG X m and /3 mj - -G 0, j ^ X m 


as m —)■ oo. Combining with (49) we have that: 


lim inf min \/3 mi \ > lim inf max — (Vg(/3 m )) 


m—too zGX, 


m—>-00 jtfiTrr 


7 lim inf IIV^JIU 

X/ m—>■ oo 


Since, lim infm^oo rnin ie x m |/3 mi | = = 0 (by hypothesis), the lhs of the above 

inequality equals zero, which leads to lirninfm^oo ||V</(/3 m )||oo = 0. 

(e) We build on the proof of Part (d). 

It follows from equation (49) (by suitably modifying ‘lim inf’ to ‘limsup’) that: 


lim sup min \(3 mi \ > lim sup max 

m—> oo <eZm ra—>• oo j£Zm 




(V 9 (j9 m )) 3 - 


y lim sup ||V 9 (/3 m )|| 0o . 


Note that the lhs of the above inequality is which is zero (by hypothesis), thus 

I|V 9 (/3J|U g 0 as m g oo. 

Suppose (3 oo is a limit point of the sequence f3 m . Thus there is a subsequence 
m! C {1,2,...,} such that (3 m , -G (3^ and g(f3 m ,) —* gi/3^). Using the continuity 
of the gradient and hence the function ■ hg ||V^(-)|| 00 we have that || V(7(/3 m /)|| 00 —* 
||V<7(/3oo)||oo = 0 as ml -G oo. Thus (3^ is a solution to the unconstrained 
(without cardinality constraints) optimization problem min g(/3). Since g{/3 m ) is 
a decreasing sequence, g((3 m ) converges to the minimum of g{(3). □ 
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B .2 Proof of Proposition 3 


Proof: 

We provide a proof of Proposition 3, for the sake of completeness. 

It suffices to consider \c,\ > 0 for all i. Let /3 be an optimal solution to Problem (22) 
and let S := {i : fa ^ 0}. The objective function is given by M 2 + Dies(A -c 0 2 - 
Note that by selecting (3i = c, t for f G S', we can make the objective function Yligs l c *l 2 - 
Thus, to minimize the objective function, S must correspond to the indices of the largest 
k values of |cj|,i > 1. □ 

B .3 Proof of Proposition 7 


Proof 

This follows from Proposition 6, Part (a), which implies that: 

9(v)-g{v) > 

for any f] G H*, (77 — j^'Vg(r])). Now by the definition of H (-), we have g{rj) = g(f]) 
which along with L > i implies that the rhs of the above inequality is zero: thus 
|| ?7 — rj\\ 2 = 0, i.e., 77 = f). Since the choice of 77 was arbitrary, it follows that 77 is the 
only element in the set Pp (77 — jf\/g{r])). □ 

B .4 Proof of Proposition 8 


Proof 

The proof follows by noting that /3 is ^-sparse along with Proposition 6, Part (a), which 
implies that: 


90) ~ g(v) > ^ 


3-77 


2 

2 


for any 77 G H^. \J3 — j-Vg(/3) j. Now, by the definition of (3 we have g((3) = g(fj) which 

along with L > £ implies that the rhs of the above inequality is zero: thus (3 is a first 
order stationary point. □ 
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B .5 Proof of Theorem 3.1 


Proof 

Summing inequalities (31) for 1 <m< M. we obtain 

m L £ M 

^(Pm) ~ 9{P m + 1 )) > WPm+l ~ PmWl, ( 50 ) 

m= 1 m=l 

leading to 

9(Pi) ~9 (Pm+ i ) > ^ min \\P m +i ~ P m \\l- 

Z 771=1,... 

Since the decreasing sequence g(/3 m+1 ) converges to g({3*) by Proposition 6 we have: 


g(Pi) -g(P*) ^ 9(Pi) -g(P 


M 


> 


M+l) 


M 


> 


(L~ 


min \\Pm + i-P‘ 

771=1,...,M 


2 

m II2* 


□ 


B.6 Proof of Proposition 5 


Proof 

If rj is a first order stationary point with \\rj\\ 0 < k, it follows from the argument 
following Definition 1, that there is a set I C {l,...,p} with \I C \ = k such that 
Vig(rj) = 0 for all i I and rj t = 0 for all i ^ I. Let g t := r/* — jfS7ig(rf) for i — 1,... ,p. 
Suppose 4 denotes the set of indices corresponding to the top k ordered values of |/q|. 
Note that: 

Hi = Vi , *6 4 and |^| = |-V j 5 f(r/)|, j ^ I k . (51) 

For i e 4 and j ^ 4 we have /q > /i|. This implies that rq| > | -bVj^7(77) |. Since 
77 e H fc (77 - g(rj)) and || 77 || 0 < k, it follows that 0 = min ie4 \r)i\ = min ie4 |/q|. We 
thus have that V7*7(77) = 0 for all j 4- In addition, note that Vig(rj) = 0 for all 
i e 4- Thus it follows that Vg(rj) = 0 and hence 77 e argmin^ <7(77). □ 

C Brief Review of Statistical Properties for the sub¬ 
set selection problem 

In this section, for the sake of completeness we briefly review some of the properties of 
solutions to Problem (1). 
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Suppose the linear model assumption is true, i.e., y = X/3° + e, with e t ~ N(0,a 2 ). 
Let (3 denote a solution to (1). [46] showed that, with probability greater than 1 — 
exp(— C\k log(p/fc)), the worst case (over /3°) predictive performance has the following 
upper bound: 


max 

/3°:||/3°||o<fc 


1 

n 


X/3° - X/3 


2 


2 k\og(p/k) 

< c 2 cr - 

n 


(52) 


where, Ci,c 2 are universal constants. Similar results also appear in [12, 58]. Interest¬ 
ingly, the upper bound (52) does not depend upon X. Unless p/k = 0(1), the upper 
bound appearing in (52) is of the order 0(a 2 kl ° s( ' p ' > ) where the constants are univer¬ 
sal. In terms of the expected (worst case) predictive risk, an upper bound is given 
by [58]: 


max — E 

/3 0 :||/3 0 || 0 <fc U 


X/3° - X/3 


~k\og(p) 


< a 2 


n 


(53) 


where, the symbol means “<” upto some universal constants. 


A natural question is how do the bounds for Lasso-based solutions compare with (53)? 
In a recent paper [58], the authors derive upper and lower bounds of the prediction 
performance of the thresholded version of the Lasso solution, which we present briefly. 
Suppose 

Pti e argmin ^-\\y - X/3||| + A„||/3||i 
p 2 n 

denotes a Lasso solution for X n = 4Let /3 TL denote the thresholded version of 

the Lasso solution, which retains the top k entries of in absolute value and sets the 
remaining to zero. The bounds on the predictive performances of Lasso based solutions 
depend upon a restricted eigen-value type condition. Following [58], we define, for any 
subset S G {1,2,... ,p}, the quantity: C(S) := {/3|||/3 Sc ||i < 2||/3 5 ||i}, where, ||/3 s ||i = 
\Pj\ an d WMi = Y2jes c \^j\- sa y ^ ie m atrix X satisfies a restricted 
eigen-value type condition with parameter 'y(X) if it satisfies the following: 


4||X/3||| > 7(X)||j 3||1 for f3 € U S: | SM C(S). 

Note that q(X) < 1 and 7 (X) is also related to the so called compatibility condition [11], 
In an insightful paper, [58] show that under such restricted eigenvalue type conditions 
the following holds: 


cr 


k 1 5 log(p) 


7(X 


bad ) 


n 


< max — E 

/3°:||h°l|o<fe n 


X/3° - X/3 


TL 


cr 2 k log(p) 
7 (X) 2 n 


(54) 


In particular, the lower bounds apply to bad design matrices Xbad for some arbitrarily 
small scalar 6 > 0. In fact [58] establish a result stronger than (54), where, /3 TL can be 
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replaced by a fc-sparse estimate delivered by a polynomial time method. The bounds 
displayed in (54) show that there is a significant gap between the predictive performance 
of subset selection procedures (see bound (53)) and Lasso based fc-sparse solutions— 
the magnitude of the gap depends upon how small y(X) is. y(X) can be small if the 
pairwise correlations between the features of the model matrix is quite high. These 
results complement our experimental findings in Section 5. 

An in-depth analysis of the properties of solutions to the Lagrangian version of Prob¬ 
lem (1), namely, Problem (4) is presented in [56]. [46, 56] also analyze the errors in 
the regression coefficients: ||/3° — /3||2, under further minor assumptions on the model 
matrix X. [56, 48] provide interesting theoretical analysis of the variable selection prop¬ 
erties of (1) and (4), showing that subset selection procedures have superior variable 
selection properties over Lasso based methods. 

In passing, we remark that [56] develop statistical properties of inexact solutions to 
Problem (4). This may serve as interesting theoretical support for near global solutions 
to Problem (1), where the certificates of sub-optimality are delivered by our MIO frame¬ 
work in terms of global lower bounds. A precise and thorough understanding of the 
statistical properties of sub-optimal solutions to Problem (1) is left for an interesting 
piece of future work. 


D Additional Details on Experiments and Compu¬ 
tations 

D.l Some additional figures related to the radii of bounding 
boxes 

Some figures illustrating the effect of the bounding box radii are presented in Fig¬ 
ure 12. 


D .2 Lasso, Debiased Lasso and MIO 

We present here comparisons of the debiased Lasso with MIO and Lasso. 

Debiasing is often used to mitigate the shrinkage imparted by the Lasso regularization 
parameter. This is done by performing an unrestricted least squares on the support 
selected by the Lasso. Of course the results will depend upon the tuning parameter 
used for the problem. We use two methods towards this end. In the first method we find 
the best Lasso solution (by obtaining an optimal tuning parameter based on minimizing 
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Evolution of the MIO gap for (37), effect of bounding box radii (n = 50 ,p — 1000). 

4loc = 00 and C hoc = 211/30 111 A 



4oc = 00 and £ f,loc = II A) 111 A 



Time (secs) Time (secs) 


Figure 12: The evolution of the MIO gap with varying radii of bounding boxes for MIO 
formulation (37). The top panel has radii twice the size of the bottom panel. The dataset 
considered is generated as per Example 1 with n = 50, p = 1000, p = 0.9 and ko = 5 for 
different values of SNR: [Left Panel] SNR = 1, [Right Panel] SNR = 3. For each case, 
different values of k have been considered. The top panel has a bounding box radii which is 
twice the corresponding case in the lower panel. As expected, the times for the MIO gaps to 
close depends upon the radii of the boxes. The optimal solutions obtained were found to be 
insensitive to the choice of the bounding box radius. 


predictive error on a held out validation set); we then obtain the un-regularized least 
squares solution for that Lasso solution. This typically performed worse than Lasso 
in all the experiments we tried—see Tables 3 and 4. The unrestricted least squares 
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solution on the optimal model selected by the Lasso (as shown in Figure 4) had worse 
predictive performance than the Lasso, with the same sparsity pattern, as shown in 
Table 3. This is probably due to overfitting since the model selected by the Lasso is 
quite dense compared to n,p. Table 4 presents the results for 50 = n <C p = 1000. 
We consider the same example presented in Figure 9, Example 1. First of all, Table 4 
presents the prediction performance of Lasso after debiasing—we considered the same 
tuning parameter considered optimal for the Lasso problem. We see that as in the case 
of Table 3, the debiasing does not lead to improved performance in terms of prediction 
error. 

We thus experimented with another variant of the debiased Lasso, where for every A 
we computed the Lasso solution (2) and obtained /3 DebA by performing an unrestricted 
least squares fit on the support selected by the Lasso solution at A. This method can 
be thought of delivering feasible solutions for Problem (1), for a value of k := k( A) 
determined by the Lasso solution at A. The success of this method makes a case in 
support of using criterion (1). The tuning parameter was then selected by minimizing 
predictive performance on a held out test validation set. This method in general per¬ 
formed better than Lasso in delivering a sparser model with better predictive accuracy 
than the Lasso. The performance of the debiased Lasso was similar to Sparsenet 
and was in general inferior to MIO by orders of magnitude, especially for the problems 
where the pairwise correlations between the variables was large and SNR was low and 
n -C p. The results are presented in Table 5,6 (for the case n > p) and 7 and 8 (for the 
case n Cp). 


Debiasing at optimal Lasso model, n > p 


SNR 

P 

Ratio: Lasso/ Debiased Lasso 

6.33 

0.5 

0.33 

3.17 

0.5 

0.54 

1.58 

0.5 

0.53 

6.97 

0.8 

0.67 

3.48 

0.8 

0.64 

1.74 

0.8 

0.63 

8.73 

0.9 

1 

4.37 

0.9 

0.58 

2.18 

0.9 

0.61 


Table 3: Lasso and Debiased Lasso corresponding to the numerical experiments of 
4, for Example 1 with n = 500, p = 100, p G {0.5,0.8, 0.9} and ko = 10. Here, “Ratio” 
the ratio of the prediction error of the Lasso and the debiased Lasso at the optimal 
parameter selected by the Lasso. 


Figure 

equals 

tuning 
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Debiasing at optimal Lasso model, n <C p 


SNR 

P 

Ratio: Lasso/Debiased Lasso 

10 

0.8 

0.90 

7 

0.8 

1.0 

3 

0.8 

0.91 


Table 4: Lasso and Debiased Lasso corresponding to the numerical experiments of Figure 
9, for Example 1 with n = 50, p = 1000, p = 0.8 and ko = 5. Here, “Ratio” equals the ratio 
of the prediction error of the Lasso and the debiased Lasso at the optimal tuning parameter 
selected by the Lasso. 


The performance of this model was comparable with Sparsenet—it was better than 
Lasso in terms of obtaining a sparser model with better predictive accuracy. However, 
the performance of MIO was significantly better than the debiased version of the Lasso, 
especially for larger values of p and smaller SNR values. 


Sparsity of Selected Models, n > p 


SNR 

P 

Lasso 

Debiased Lasso 

MIO 

6.33 

0.5 

27.6 (2.122) 

10.9 (0.65) 

10.8 (0.51) 

3.17 

0.5 

27.7 (2.045) 

10.9 (0.65) 

10.1 (0.1) 

1.58 

0.5 

28.0 (2.276) 

10.9 (0.65) 

10.2 (0.2) 

6.97 

0.8 

34.1 (3.60) 

10.4 (0.15) 

10 (0.0) 

3.48 

0.8 

34.0 (3.54) 

10.9 (0.55) 

10.2 (0.2) 

1.74 

0.8 

33.7 (3.49) 

13.7 (1.50) 

10 (0.0) 

8.73 

0.9 

25.9 (0.94) 

13.9 (0.68) 

10.5 (0.17) 

4.37 

0.9 

34.6 (3.23) 

18.1 (1.30) 

10.2 (0.25) 

2.18 

0.9 

34.7 (3.28) 

20.5 (1.85) 

10.1 (0.10) 


Table 5: Number of non-zeros in the selected model by Lasso, Debiased Lasso, and MIO 
corresponding to the numerical experiments of Figure 4, for Example 1 with n = 500, p = 
100, p e {0.5,0.8, 0.9} and ko = 10. The tuning parameters for all three models were selected 
separately based on the best predictive model on a held out validation set. Numbers within 
brackets denote standard-errors. Debiased Lasso leads to less dense models than Lasso. 
When p is small and SNR is large, the model size of debiased Lasso performance is similar 
to MIO. However, for larger values of p and smaller values of SNR subset selection leads to 
orders of magnitude sparser solutions than debiased Lasso. 

We then follow the method described above (for the n > p case), where we consider a 
sequence of models /3 Deb A and find the A that delivers the best predictive model on a 
held out validation set. 
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Predictive Performance of Selected Models, n > p 


SNR 

P 

Lasso 

Debiased Lasso 

MIO 

Ratio: 

Debiased Lasso/MIO 

6.33 

0.5 

0.0384 (0.001) 

0.0255 (0.002) 

0.0266 (0.001) 

1.0 

3.17 

0.5 

0.0768 (0.003) 

0.0511 (0.004) 

0.0478 (0.002) 

1.0 

1.58 

0.5 

0.1540 (0.007) 

0.1021 (0.009) 

0.0901 (0.009) 

1.1 

6.97 

0.8 

0.0389 (0.002) 

0.0223 (0.001) 

0.0231 (0.002) 

1.0 

3.48 

0.8 

0.0778 (0.004) 

0.0464 (0.003) 

0.0484 (0.004) 

1.0 

1.74 

0.8 

0.1557 (0.007) 

0.1156 (0.008) 

0.0795 (0.008) 

1.5 

8.73 

0.9 

0.0325 (0.001) 

0.0220 (0.002) 

0.0197 (0.002) 

1.2 

4.37 

0.9 

0.0632 (0.002) 

0.0532 (0.003) 

0.0427 (0.008) 

1.3 

2.18 

0.9 

0.1265 (0.005) 

0.1254 (0.006) 

0.0703 (0.011) 

1.8 


Table 6: Predictive Performance for tests of Lasso, Debiased Lasso, and MIO correspond¬ 
ing to the numerical experiments of Figure 4, for Example 1 with n = 500, p = 100, p E 
{0.5, 0.8,0.9} and ko = 10. Numbers within brackets denote standard-errors. The tuning 
parameters for all three models were selected separately based on the best predictive model 
on a held out validation set. When p is small and SNR is large, debiased Lasso performance 
is similar to MIO. However, for larger values of p and smaller values of SNR subset selection 
performs better than debiased Lasso based solutions. 


Sparsity of Selected Models, n <C p 


SNR 

P 

Lasso 

Debiased Lasso 

MIO 

10 

0.8 

25.7 (1.73) 

7.9 (0.43) 

5 (0.12) 

7 

0.8 

27.8 (2.69) 

8.1 (0.43) 

5 (0.16) 

3 

0.8 

28.0 (2.72) 

10.0 (0.88) 

6 (1.18) 


Table 7: Number of non-zeros in the selected model by Lasso, Debiased Lasso, and MIO 
corresponding to the numerical experiments of Figure 9, for Example lwith n = 50, p = 
1000, p = 0.8 and ko = 5. Numbers within brackets denote standard-errors. The tuning 
parameters for all three models were selected separately based on the best predictive model 
on a held out validation set. Debiased Lasso leads to less dense models than Lasso but more 
dense models than MIO. The performance gap between MIO and debiased Lasso becomes 
larger with lower values of SNR. 
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Predictive Performance of Selected Models, n <C p 


SNR 

P 

Lasso 

Debiased 

Lasso 

MIO 

Ratio: 

Debiased Lasso/ MIO 

10 

0.8 

0.084 (0.004) 

0.046 (0.003) 

0.014 (0.005) 

3.3 

7 

0.8 

0.122 (0.005) 

0.070 (0.004) 

0.020 (0.007) 

3.5 

3 

0.8 

0.257 (0.012) 

0.185 (0.016) 

0.151 (0.027) 

1.2 


Table 8: Predictive performances of Lasso, Debiased Lasso, and MIO corresponding to 
the numerical experiments of Figure 9, for Example lwith n = 50, p = 1000, p = 0.8 and 
ko = 5. Numbers within brackets denote standard-errors. The tuning parameters for all three 
models were selected separately based on the best predictive model on a held out validation 
set. MIO consistently leads to better predictive models than Debiased Lasso and ordinary 
Lasso. Debiased Lasso performs better than ordinary Lasso. 
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